Taking the aggravation out of data aggregation: A conceptual guide to dealing with statistical issues related to the pooling of individual‐level observational data


Field data often include multiple observations taken from the same individual. In order to avoid pseudoreplication, it is commonplace to aggregate data, generating a mean score per individual, and then using these aggregated data in subsequent analyses. Aggregation, however, can generate problems of its own. Not only does it lead to a loss of information, it can also leave analyses vulnerable to the “ecological fallacy”: the drawing of false inferences about individual behavior on the basis of population level (“ecological”) data. It can also result in Simpson's paradox, where relationships seen at the individual level can be completely reversed when analyzed at the aggregate level. These phenomena have been documented widely in the medical and social sciences but tend to go unremarked in primatological studies that rely on observational data from the field. Here, we provide a conceptual guide that explains how and why aggregate data are vulnerable to the ecological fallacy and Simpson's paradox, illustrating these points using data on baboons. We then discuss one particular analytical approach, namely multi‐level modeling, that can potentially eliminate these problems. By highlighting the issue of the ecological fallacy, and increasing awareness of how datasets are often organized into a number of different levels, we also highlight the manner in which researchers can more positively exploit the structure of their datasets, without any information loss. These analytical approaches may thus provide greater insight into behavior by permitting more thorough investigation of interactions and cross‐level effects.

In American Journal of Human Primatology