Skip to content

In 1939, E. L. Thorndike gave an example of what eventually became known as the ecological fallacy. Here is his example.

License

Notifications You must be signed in to change notification settings

ASKurz/ecological_fallacy_Thorndike

Repository files navigation

The ecologocal fallacy: Thorndike’s (1939) example

The ecological fallacy occurs when results from group-level analyses are wrongly extended to individuals. The fallacy is often attributed to William S. Robinson’s (1950) paper, Ecological correlations and the behavior of individuals, and the name itself first appeared in sociologist Hanan C. Selvin’s (1958) paper, Durkheim’s suicide and problems of empirical research. However, my fellow psychologists might be happy to learn the idea goes back at least as far as E. L. Thorndike’s (1939) paper, On the fallacy of imputing the correlations found for groups to the individuals or smaller groups composing them. The purpose of this project is to walk out Thorndike’s original examples.

Here’s how Thorndike began his paper:

If the correlation between two traits, A and B (say, poverty and delinquency), in n groups (say, the residents of w districts) has a certain value, K, the correlation between A and B in the individuals or the families composing the groups need not be K and will not be, save in very special circumstances. (p. 122)

Thorndike referred to an example of this mistake from the literature of his time, which we won’t get into, here. He then worked through the fallacy with 12 simulated data sets. We’ll cover those in detail. In each of his synthetic data sets, “A is supposed to be intelligence quotient and B is supposed to be the fraction of a room or number of rooms per person” (p. 124). That is, he was covering the correlation between IQ and the crowdedness of one’s living conditions.

Here we’ll copy down the data into a series of tribbles (see Müller & Wickham, 2020; also Wickham, 2019; Wickham et al., 2019).

library(tidyverse)

# table I
t1 <-
  tribble(
  ~a, ~b, ~n,
  -1, -3, 1,
  0, -3, 1,
  1, -3, 1,
  
  -3, -2, 1,
  -2, -2, 1,
  -1, -2, 4,
  0, -2, 6,
  1, -2, 4,
  2, -2, 1,
  3, -2, 1,
  
  -4, -1, 1,
  -3, -1, 2,
  -2, -1, 4,
  -1, -1, 7,
  0, -1, 15,
  1, -1, 7,
  2, -1, 4,
  3, -1, 2,
  4, -1, 1,
  
  -4, 0, 2,
  -3, 0, 2,
  -2, 0, 6,
  -1, 0, 8,
  0, 0, 20,
  1, 0, 8,
  2, 0, 6,
  3, 0, 2,
  4, 0, 2,
  
  -4, 1, 1,
  -3, 1, 2,
  -2, 1, 4,
  -1, 1, 7,
  0, 1, 15,
  1, 1, 7,
  2, 1, 4,
  3, 1, 2,
  4, 1, 1,
  
  -3, 2, 1,
  -2, 2, 1,
  -1, 2, 4,
  0, 2, 6,
  1, 2, 4,
  2, 2, 1,
  3, 2, 1,
  
  -1, 3, 1,
  0, 3, 1,
  1, 3, 1
) 

# table II

t2 <-
  tribble(
  ~a, ~b, ~n,
  -2, -3, 1,
  -1, -3, 1,
  0, -3, 1,
  1, -3, 1,
  
  -3, -2, 1,
  -2, -2, 2,
  -1, -2, 4,
  0, -2, 4,
  1, -2, 2,
  2, -2, 1,
  
  -3, -1, 1,
  -2, -1, 3,
  -1, -1, 6,
  0, -1, 6,
  1, -1, 3,
  2, -1, 1,
  
  -3, 0, 1,
  -2, 0, 2,
  -1, 0, 4,
  0, 0, 4,
  1, 0, 2,
  2, 0, 1,
  
  -2, 1, 1,
  -1, 1, 1,
  0, 1, 1,
  1, 1, 1,
) 

# table III

t3 <-
  tribble(
  ~a, ~b, ~n,
  -1, -2, 1,
  0, -2, 1,
  1, -2, 1,
  2, -2, 1,
  
  -3, -1, 1,
  -2, -1, 1,
  -1, -1, 2,
  0, -1, 5,
  1, -1, 5,
  2, -1, 2,
  3, -1, 1,
  
  -3, 0, 1,
  -2, 0, 3,
  -1, 0, 5,
  0, 0, 10,
  1, 0, 10,
  2, 0, 5,
  3, 0, 3,
  4, 0, 2,
  
  -3, 1, 1,
  -2, 1, 3,
  -1, 1, 5,
  0, 1, 10,
  1, 1, 10,
  2, 1, 5,
  3, 1, 3,
  4, 1, 2,
  
  -3, 2, 1,
  -2, 2, 1,
  -1, 2, 2,
  0, 2, 5,
  1, 2, 5,
  2, 2, 2,
  3, 2, 1,
  
  -1, 3, 1,
  0, 3, 1,
  1, 3, 1,
  2, 3, 1
) 

# table IV

t4 <-
  tribble(
  ~a, ~b, ~n,
  -3, -3, 1,
  -2, -3, 1,
  -1, -3, 1,
  0, -3, 1,
  
  -4, -2, 1,
  -3, -2, 2,
  -2, -2, 4,
  -1, -2, 4,
  0, -2, 2,
  1, -2, 1,
  
  -4, -1, 1,
  -3, -1, 3,
  -2, -1, 6,
  -1, -1, 6,
  0, -1, 3,
  1, -1, 1,
  
  -4, 0, 1,
  -3, 0, 2,
  -2, 0, 4,
  -1, 0, 4,
  0, 0, 2,
  1, 0, 1,
  
  -3, 1, 1,
  -2, 1, 1,
  -1, 1, 1,
  0, 1, 1,
) 

# table V

t5 <-
  tribble(
  ~a, ~b, ~n,
  -3, -4, 1,
  -2, -4, 1,
  -1, -4, 1,
  
  -4, -3, 1,
  -3, -3, 2,
  -2, -3, 4,
  -1, -3, 2,
  0, -3, 1,
  
  -4, -2, 1,
  -3, -2, 3,
  -2, -2, 6,
  -1, -2, 3,
  0, -2, 1,
  
  -4, -1, 1,
  -3, -1, 2,
  -2, -1, 4,
  -1, -1, 2,
  0, -1, 1,
  
  -3, 0, 1,
  -2, 0, 1,
  -1, 0, 1
) 

# table VI

t6 <-
  tribble(
  ~a, ~b, ~n,
  -2, -4, 1,
  -1, -4, 1,
  0, -4, 1,
  
  -3, -3, 1,
  -2, -3, 2,
  -1, -3, 4,
  0, -3, 2,
  1, -3, 1,
  
  -3, -2, 1,
  -2, -2, 3,
  -1, -2, 6,
  0, -2, 3,
  1, -2, 1,
  
  -3, -1, 1,
  -2, -1, 2,
  -1, -1, 4,
  0, -1, 2,
  1, -1, 1,
  
  -2, 0, 1,
  -1, 0, 1,
  0, 0, 1
) 

# table VII

t7 <-
  tribble(
  ~a, ~b, ~n,
  -2, -3, 1,
  -1, -3, 1,
  0, -3, 1,
  
  -3, -2, 1,
  -2, -2, 2,
  -1, -2, 4,
  0, -2, 2,
  1, -2, 1,
  
  -3, -1, 1,
  -2, -1, 3,
  -1, -1, 6,
  0, -1, 3,
  1, -1, 1,
  
  -3, 0, 1,
  -2, 0, 2,
  -1, 0, 4,
  0, 0, 2,
  1, 0, 1,
  
  -2, 1, 1,
  -1, 1, 1,
  0, 1, 1
) 

# table VIII

t8 <-
  tribble(
  ~a, ~b, ~n,
  -1, -4, 1,
  0, -4, 1,
  1, -4, 1,
  
  -2, -3, 1,
  -1, -3, 2,
  0, -3, 4,
  1, -3, 2,
  2, -3, 1,
  
  -2, -2, 1,
  -1, -2, 3,
  0, -2, 6,
  1, -2, 3,
  2, -2, 1,
  
  -2, -1, 1,
  -1, -1, 2,
  0, -1, 4,
  1, -1, 2,
  2, -1, 1,
  
  -1, 0, 1,
  0, 0, 1,
  1, 0, 1
) 

# table IX

t9 <-
  tribble(
  ~a, ~b, ~n,
  -1, -2, 1,
  0, -2, 1,
  1, -2, 1,
  
  -2, -1, 1,
  -1, -1, 2,
  0, -1, 4,
  1, -1, 2,
  2, -1, 1,
  
  -2, 0, 1,
  -1, 0, 3,
  0, 0, 6,
  1, 0, 3,
  2, 0, 1,
  
  -2, 1, 1,
  -1, 1, 2,
  0, 1, 4,
  1, 1, 2,
  2, 1, 1,
  
  -1, 2, 1,
  0, 2, 1,
  1, 2, 1
) 

# table X

t10 <-
  tribble(
  ~a, ~b, ~n,
  0, -1, 1,
  1, -1, 1,
  2, -1, 1,
  
  -1, 0, 1,
  0, 0, 2,
  1, 0, 4,
  2, 0, 2,
  3, 0, 1,
  
  -1, 1, 1,
  0, 1, 3,
  1, 1, 6,
  2, 1, 3,
  3, 1, 1,
  
  -1, 2, 1,
  0, 2, 2,
  1, 2, 4,
  2, 2, 2,
  3, 2, 1,
  
  0, 3, 1,
  1, 3, 1,
  2, 3, 1
) 

# table XI

t11 <-
  tribble(
  ~a, ~b, ~n,
  1, 0, 1,
  2, 0, 1,
  3, 0, 1,
  
  0, 1, 1,
  1, 1, 2,
  2, 1, 4,
  3, 1, 2,
  4, 1, 1,
  
  0, 2, 1,
  1, 2, 3,
  2, 2, 6,
  3, 2, 3,
  4, 2, 1,
  
  0, 3, 1,
  1, 3, 2,
  2, 3, 4,
  3, 3, 2,
  4, 3, 1,
  
  1, 4, 1,
  2, 4, 1,
  3, 4, 1
) 

# table XII

t12 <-
  tribble(
  ~a, ~b, ~n,
  2, 1, 1,
  3, 1, 1,
  4, 1, 1,
  
  1, 2, 1,
  2, 2, 2,
  3, 2, 4,
  4, 2, 2,
  5, 2, 1,
  
  1, 3, 1,
  2, 3, 3,
  3, 3, 6,
  4, 3, 3,
  5, 3, 1,
  
  1, 4, 1,
  2, 4, 2,
  3, 4, 4,
  4, 4, 2,
  5, 4, 1,
  
  2, 5, 1,
  3, 5, 1,
  4, 5, 1
) 

We’ll combine them, here.

t13 <-
  bind_rows(
  t1  %>% mutate(t = 1),
  t2  %>% mutate(t = 2),
  t3  %>% mutate(t = 3),
  t4  %>% mutate(t = 4),
  t5  %>% mutate(t = 5),
  t6  %>% mutate(t = 6),
  t7  %>% mutate(t = 7),
  t8  %>% mutate(t = 8),
  t9  %>% mutate(t = 9),
  t10 %>% mutate(t = 10),
  t11 %>% mutate(t = 11),
  t12 %>% mutate(t = 12)
) 

In Thorndike’s example, the data from each of his 12 tables corresponded to “scores for sample persons in each of twelve districts into which a city is divided” (p. 124). For simplicity, he gave the values in a standardized (i.e., z-score) metric. Here’s a plot of all those data, disaggregated by district. It corresponds directly to the way Thorndike presented the tables in his article.

t13 %>% 
  mutate(table = t) %>% 
  
  ggplot(aes(x = a, y = b)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = n, color = n < 11),
            show.legend = F) +
  scale_fill_viridis_c(expression(italic(n)), option = "C", limits = c(0, NA)) +
  scale_color_manual(values = c("black", "white")) +
  scale_x_continuous(breaks = -4:4, position = "top") +
  scale_y_continuous(breaks = -4:4, trans = "reverse") +
  theme(panel.grid = element_blank()) +
  facet_wrap(~table, ncol = 2, labeller = label_both)

Thorndike asserted that “within each of the districts the correlation between A and B is zero” (p. 124). Let’s check.

t13 %>% 
  uncount(n) %>% 
  group_by(t) %>% 
  summarise(r = cor(a, b))
## # A tibble: 12 x 2
##        t     r
##    <dbl> <dbl>
##  1     1     0
##  2     2     0
##  3     3     0
##  4     4     0
##  5     5     0
##  6     6     0
##  7     7     0
##  8     8     0
##  9     9     0
## 10    10     0
## 11    11     0
## 12    12     0

Yep. Those correlations are zero for each. Here’s what the data look like if you naïvely combine the numbers from all districts.

t13 %>% 
  uncount(n) %>% 
  group_by(a, b) %>% 
  count() %>% 
  
  ggplot(aes(x = a, y = b)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = n, color = n < 20),
            show.legend = F) +
  scale_fill_viridis_c(expression(italic(n)), option = "C", limits = c(0, NA)) +
  scale_color_manual(values = c("black", "white")) +
  scale_x_continuous(breaks = -4:5, position = "top") +
  scale_y_continuous(breaks = -4:5, trans = "reverse") +
  theme(panel.grid = element_blank())

This corresponds to Thorndike’s Table XIII. If you look closely at his original, you’ll see he miscounted a few cells. No worry–the overall findings are still sound. Here’s the bivariate correlation for the naïvely combined data.

t13 %>% 
  uncount(n) %>% 
  summarise(r = cor(a, b))
## # A tibble: 1 x 1
##       r
##   <dbl>
## 1 0.454

But what if we look at aggregated data? To do so, we’ll take the mean of a and b from each of the districts. Here’s what that plot might look like.

t13 %>%
  uncount(n) %>% 
  group_by(t) %>% 
  summarise(a = mean(a),
            b = mean(b)) %>% 
  count(a, b) %>% 
  
  ggplot(aes(x = a, y = b)) +
  geom_tile(aes(fill = n)) +
  geom_text(aes(label = n, color = n < 2),
            show.legend = F) +
  scale_fill_viridis_c(expression(italic(n)), option = "C", breaks = 1:2) +
  scale_color_manual(values = c("black", "white")) +
  scale_x_continuous(breaks = -4:5, position = "top") +
  scale_y_continuous(breaks = -4:5, trans = "reverse") +
  theme(panel.grid = element_blank())

That plot corresponds directly to Thorndike’s Table XIV. Here’s the correlation.

t13 %>%
  uncount(n) %>% 
  group_by(t) %>% 
  summarise(mu_a = mean(a),
            mu_b = mean(b)) %>% 
  summarise(r = cor(mu_a, mu_b))
## # A tibble: 1 x 1
##       r
##   <dbl>
## 1 0.914

To recap, within each district the correlation is exactly 0. When we naïvely combine all the data, the correlation is about .45. When we aggregate the data, the correlation goes to .91. We might as well finish right where we started.

If the correlation between two traits, A and B (say, poverty and delinquency), in n groups (say, the residents of w districts) has a certain value, K, the correlation between A and B in the individuals or the families composing the groups need not be K and will not be, save in very special circumstances. (p. 122)

That is, the results from group-level data will not necessarily correspond to the results of subgroup- or individual-level data. Proceed with caution. To learn more about this phenomena, see the (2012) chapter by Hamaker or the paper by Kievit et al. (2013).

Session info

sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.0   stringr_1.4.0   dplyr_1.0.1     purrr_0.3.4    
## [5] readr_1.3.1     tidyr_1.1.1     tibble_3.0.3    ggplot2_3.3.2  
## [9] tidyverse_1.3.0
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0  xfun_0.13         haven_2.2.0       lattice_0.20-38  
##  [5] colorspace_1.4-1  vctrs_0.3.4       generics_0.0.2    htmltools_0.5.0  
##  [9] viridisLite_0.3.0 yaml_2.2.1        utf8_1.1.4        rlang_0.4.7      
## [13] pillar_1.4.6      glue_1.4.2        withr_2.2.0       DBI_1.1.0        
## [17] dbplyr_1.4.2      modelr_0.1.6      readxl_1.3.1      lifecycle_0.2.0  
## [21] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0  rvest_0.3.5      
## [25] evaluate_0.14     labeling_0.3      knitr_1.28        fansi_0.4.1      
## [29] broom_0.5.5       Rcpp_1.0.5        scales_1.1.1      backports_1.1.9  
## [33] jsonlite_1.7.0    farver_2.0.3      fs_1.4.1          hms_0.5.3        
## [37] digest_0.6.25     stringi_1.4.6     grid_3.6.3        cli_2.0.2        
## [41] tools_3.6.3       magrittr_1.5      crayon_1.3.4      pkgconfig_2.0.3  
## [45] ellipsis_0.3.1    xml2_1.3.1        reprex_0.3.0      lubridate_1.7.8  
## [49] assertthat_0.2.1  rmarkdown_2.1     httr_1.4.1        rstudioapi_0.11  
## [53] R6_2.4.1          nlme_3.1-144      compiler_3.6.3

References

Hamaker, E. L. (2012). Why researchers should think "within-person": A paradigmatic rationale. In Handbook of research methods for studying daily life (pp. 43–61). The Guilford Press. https://www.guilford.com/books/Handbook-of-Research-Methods-for-Studying-Daily-Life/Mehl-Conner/9781462513055

Kievit, R., Frankenhuis, W. E., Waldorp, L., & Borsboom, D. (2013). Simpson’s paradox in psychological science: A practical guide. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00513

Müller, K., & Wickham, H. (2020). tibble: Simple data frames. https://CRAN.R-project.org/package=tibble

Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review, 15(3). https://doi.org/10.2307/2087176

Selvin, H. C. (1958). Durkheim’s suicide and problems of empirical research. American Journal of Sociology, 63(6), 607–619. https://doi.org/10.1086/222356

Thorndike, E. L. (1939). On the fallacy of imputing the correlations found for groups to the individuals or smaller groups composing them. The American Journal of Psychology, 52(1), 122–124. https://doi.org/10.2307/1416673

Wickham, H. (2019). tidyverse: Easily install and load the ’tidyverse’. https://CRAN.R-project.org/package=tidyverse

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

About

In 1939, E. L. Thorndike gave an example of what eventually became known as the ecological fallacy. Here is his example.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages