Frequency tests

Problem

You have categorical data and want test whether the frequency distribution of values differs from an expected frequency, or if the distribution differs between groups.

Solution

There are two general questions that frequency tests address:

  • Does the frequency distribution differ from an expected, or theoretical, distribution (e.g., expect 50% yes, 50% no)? (Goodness-of-fit test)
  • Does the frequency distribution differ between two or more groups? (Independence test) Of the statistical tests commonly used to address these questions, there are exact tests and approximate tests.
Expected distribution Comparing groups
Exact Exact binomial Fisher’s exact
Approximate Chi-square goodness of fit Chi-square test of independence

Note: The exact binomial test can only be used on one variable that has two levels. Fisher’s exact test can only be used with two-dimensional contingency tables (for example, it can be used when there is one independent variable and one dependent variable, but not when there are 2 IVs and 1 DV).

To test for paired or within-subjects effects, McNemar’s test can be used. To use it, there must be one IV with two levels, and one DV with two levels.

To test for the independence of two variables with repeated measurements, the Cochran-Mantel-Haenszel test can be used.

Suppose this is your data, where each row represents one case:

  1. data <- read.table(header=TRUE, text='
  2. condition result
  3. control 0
  4. control 0
  5. control 0
  6. control 0
  7. treatment 1
  8. control 0
  9. control 0
  10. treatment 0
  11. treatment 1
  12. control 1
  13. treatment 1
  14. treatment 1
  15. treatment 1
  16. treatment 1
  17. treatment 0
  18. control 0
  19. control 1
  20. control 0
  21. control 1
  22. treatment 0
  23. treatment 1
  24. treatment 0
  25. treatment 0
  26. control 0
  27. treatment 1
  28. control 0
  29. control 0
  30. treatment 1
  31. treatment 0
  32. treatment 1
  33. ')

Instead of a data frame of cases, your data might be stored as a data frame of counts, or as a contingency table. For the analyses here, your data must be in a contingency table. See this page for more information.

Tests of goodness-of-fit (expected frequency)

Chi-square test

To test the hypothesis that the two values in the result column (ignoring condition) are equally likely (50%-50%) in the population:

  1. # Create contingency table for result, which contains values 0 and 1
  2. # The column names are "0" and "1" (they're not actually values in the table)
  3. ct <- table(data$result)
  4. ct
  5. #>
  6. #> 0 1
  7. #> 17 13
  8. # An alternative is to manually create the table
  9. #ct <- matrix(c(17,13), ncol=2)
  10. #colnames(ct1) <- c("0", "1")
  11. # Perform Chi-square test
  12. chisq.test(ct)
  13. #>
  14. #> Chi-squared test for given probabilities
  15. #>
  16. #> data: ct
  17. #> X-squared = 0.53333, df = 1, p-value = 0.4652

To test the sample with different expected frequencies (in this case 0.75 and 0.25):

  1. # Probability table - must sum to 1
  2. pt <- c(.75, .25)
  3. chisq.test(ct, p=pt)
  4. #>
  5. #> Chi-squared test for given probabilities
  6. #>
  7. #> data: ct
  8. #> X-squared = 5.3778, df = 1, p-value = 0.02039

If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str(), and pull out the parts you want. For example:

  1. chi_res <- chisq.test(ct, p=pt)
  2. # View all the parts that can be extracted
  3. str(chi_res)
  4. #> List of 9
  5. #> $ statistic: Named num 5.38
  6. #> ..- attr(*, "names")= chr "X-squared"
  7. #> $ parameter: Named num 1
  8. #> ..- attr(*, "names")= chr "df"
  9. #> $ p.value : num 0.0204
  10. #> $ method : chr "Chi-squared test for given probabilities"
  11. #> $ data.name: chr "ct"
  12. #> $ observed : 'table' int [1:2(1d)] 17 13
  13. #> ..- attr(*, "dimnames")=List of 1
  14. #> .. ..$ : chr [1:2] "0" "1"
  15. #> $ expected : Named num [1:2] 22.5 7.5
  16. #> ..- attr(*, "names")= chr [1:2] "0" "1"
  17. #> $ residuals: table [1:2(1d)] -1.16 2.01
  18. #> ..- attr(*, "dimnames")=List of 1
  19. #> .. ..$ : chr [1:2] "0" "1"
  20. #> $ stdres : table [1:2(1d)] -2.32 2.32
  21. #> ..- attr(*, "dimnames")=List of 1
  22. #> .. ..$ : chr [1:2] "0" "1"
  23. #> - attr(*, "class")= chr "htest"
  24. # Get the Chi-squared value
  25. chi_res$statistic
  26. #> X-squared
  27. #> 5.377778
  28. # Get the p-value
  29. chi_res$p.value
  30. #> [1] 0.02039484

Exact binomial test

The exact binomial test is used only with data where there is one variable with two possible values.

  1. ct <- table(data$result)
  2. ct
  3. #>
  4. #> 0 1
  5. #> 17 13
  6. binom.test(ct, p=0.5)
  7. #>
  8. #> Exact binomial test
  9. #>
  10. #> data: ct
  11. #> number of successes = 17, number of trials = 30, p-value = 0.5847
  12. #> alternative hypothesis: true probability of success is not equal to 0.5
  13. #> 95 percent confidence interval:
  14. #> 0.3742735 0.7453925
  15. #> sample estimates:
  16. #> probability of success
  17. #> 0.5666667
  18. # Using expected probability of 75% -- notice that 1 is in the second column of
  19. # the table so need to use p=.25
  20. binom.test(ct, p=0.25)
  21. #>
  22. #> Exact binomial test
  23. #>
  24. #> data: ct
  25. #> number of successes = 17, number of trials = 30, p-value = 0.0002157
  26. #> alternative hypothesis: true probability of success is not equal to 0.25
  27. #> 95 percent confidence interval:
  28. #> 0.3742735 0.7453925
  29. #> sample estimates:
  30. #> probability of success
  31. #> 0.5666667

If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str(), and pull out the parts you want. For example:

  1. bin_res <- binom.test(ct, p=0.25)
  2. # View all the parts that can be extracted
  3. str(bin_res)
  4. #> List of 9
  5. #> $ statistic : Named num 17
  6. #> ..- attr(*, "names")= chr "number of successes"
  7. #> $ parameter : Named num 30
  8. #> ..- attr(*, "names")= chr "number of trials"
  9. #> $ p.value : Named num 0.000216
  10. #> ..- attr(*, "names")= chr "0"
  11. #> $ conf.int : atomic [1:2] 0.374 0.745
  12. #> ..- attr(*, "conf.level")= num 0.95
  13. #> $ estimate : Named num 0.567
  14. #> ..- attr(*, "names")= chr "probability of success"
  15. #> $ null.value : Named num 0.25
  16. #> ..- attr(*, "names")= chr "probability of success"
  17. #> $ alternative: chr "two.sided"
  18. #> $ method : chr "Exact binomial test"
  19. #> $ data.name : chr "ct"
  20. #> - attr(*, "class")= chr "htest"
  21. # Get the p-value
  22. bin_res$p.value
  23. #> 0
  24. #> 0.0002156938
  25. # Get the 95% confidence interval
  26. bin_res$conf.int
  27. #> [1] 0.3742735 0.7453925
  28. #> attr(,"conf.level")
  29. #> [1] 0.95

Tests of independence (comparing groups)

Chi-square test

To test whether the control and treatment conditions result in different frequencies, use a 2D contingency table.

  1. ct <- table(data$condition, data$result)
  2. ct
  3. #>
  4. #> 0 1
  5. #> control 11 3
  6. #> treatment 6 10
  7. chisq.test(ct)
  8. #>
  9. #> Pearson's Chi-squared test with Yates' continuity correction
  10. #>
  11. #> data: ct
  12. #> X-squared = 3.593, df = 1, p-value = 0.05802
  13. chisq.test(ct, correct=FALSE)
  14. #>
  15. #> Pearson's Chi-squared test
  16. #>
  17. #> data: ct
  18. #> X-squared = 5.1293, df = 1, p-value = 0.02353

For 2x2 tables, it uses Yates’s continuity correction by default. This test is more conservative for small sample sizes.The flag correct=FALSE, tells it to use Pearson’s chi-square test without the correction.

Fisher’s exact test

For small sample sizes, Fisher’s exact test may be more appropriate. It is typically used for 2x2 tables with relatively small sample sizes because it is computationally intensive for more complicated (e.g., 2x3) tables, and those with larger sample sizes. However, the implementation in R seems to handle larger cases without trouble.

  1. ct <- table(data$condition, data$result)
  2. ct
  3. #>
  4. #> 0 1
  5. #> control 11 3
  6. #> treatment 6 10
  7. fisher.test(ct)
  8. #>
  9. #> Fisher's Exact Test for Count Data
  10. #>
  11. #> data: ct
  12. #> p-value = 0.03293
  13. #> alternative hypothesis: true odds ratio is not equal to 1
  14. #> 95 percent confidence interval:
  15. #> 0.966861 45.555016
  16. #> sample estimates:
  17. #> odds ratio
  18. #> 5.714369

Cochran-Mantel-Haenszel test

The Cochran-Mantel-Haenszel test (or Mantel-Haenszel test) is used for testing the independence of two dichotomous variables with repeated measurements. It is most commonly used with 2x2xK tables, where K is the number of measurement conditions. For example, you may want to know whether a treatment (C vs. D) affects the likelihood of recovery (yes or no). Suppose, though, that the treatments were administered at three different times of day, morning, afternoon, and night – and that you want your test to control for this. The CMH test would then operate on a 2x2x3 contingency table, where the third variable is the one you wish to control for.

The implementation of the CMH test in R can handle dimensions greater than 2x2xK. For example, you could use it for a 3x3xK contingency table.

In the following example (borrowed from McDonald’s Handbook of Biological Statistics), there are three variables: Location, Allele, and Habitat. The question is whether Allele (94 or non-94) and Habitat (marine or estuarine) are independent, when location is controlled for.

  1. fish <- read.table(header=TRUE, text='
  2. Location Allele Habitat Count
  3. tillamook 94 marine 56
  4. tillamook 94 estuarine 69
  5. tillamook non-94 marine 40
  6. tillamook non-94 estuarine 77
  7. yaquina 94 marine 61
  8. yaquina 94 estuarine 257
  9. yaquina non-94 marine 57
  10. yaquina non-94 estuarine 301
  11. alsea 94 marine 73
  12. alsea 94 estuarine 65
  13. alsea non-94 marine 71
  14. alsea non-94 estuarine 79
  15. umpqua 94 marine 71
  16. umpqua 94 estuarine 48
  17. umpqua non-94 marine 55
  18. umpqua non-94 estuarine 48
  19. ')

Note that the data above is entered as a data frame of counts, instead of a data frame of cases as in previous examples. Instead of using table() to convert it to a contingency table, use xtabs() instead. For more information, see this page.

  1. # Make a 3D contingency table, where the last variable, Location, is the one to
  2. # control for. If you use table() for case data, the last variable is also the
  3. # one to control for.
  4. ct <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
  5. ct
  6. #> , , Location = alsea
  7. #>
  8. #> Habitat
  9. #> Allele estuarine marine
  10. #> 94 65 73
  11. #> non-94 79 71
  12. #>
  13. #> , , Location = tillamook
  14. #>
  15. #> Habitat
  16. #> Allele estuarine marine
  17. #> 94 69 56
  18. #> non-94 77 40
  19. #>
  20. #> , , Location = umpqua
  21. #>
  22. #> Habitat
  23. #> Allele estuarine marine
  24. #> 94 48 71
  25. #> non-94 48 55
  26. #>
  27. #> , , Location = yaquina
  28. #>
  29. #> Habitat
  30. #> Allele estuarine marine
  31. #> 94 257 61
  32. #> non-94 301 57
  33. # This prints ct in a "flat" format
  34. ftable(ct)
  35. #> Location alsea tillamook umpqua yaquina
  36. #> Allele Habitat
  37. #> 94 estuarine 65 69 48 257
  38. #> marine 73 56 71 61
  39. #> non-94 estuarine 79 77 48 301
  40. #> marine 71 40 55 57
  41. # Print with different arrangement of variables
  42. ftable(ct, row.vars=c("Location","Allele"), col.vars="Habitat")
  43. #> Habitat estuarine marine
  44. #> Location Allele
  45. #> alsea 94 65 73
  46. #> non-94 79 71
  47. #> tillamook 94 69 56
  48. #> non-94 77 40
  49. #> umpqua 94 48 71
  50. #> non-94 48 55
  51. #> yaquina 94 257 61
  52. #> non-94 301 57
  53. mantelhaen.test(ct)
  54. #>
  55. #> Mantel-Haenszel chi-squared test with continuity correction
  56. #>
  57. #> data: ct
  58. #> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
  59. #> alternative hypothesis: true common odds ratio is not equal to 1
  60. #> 95 percent confidence interval:
  61. #> 0.6005522 0.9593077
  62. #> sample estimates:
  63. #> common odds ratio
  64. #> 0.759022

According to this test, there is a relationship between Allele and Habitat, controlling for Location, p=.025.

Note that the first two dimensions of the contingency table are treated the same (so their order can be swapped without affecting the test result), the highest-order dimension in the contingency table is different. This is illustrated below.

  1. # The following two create different contingency tables, but have the same result
  2. # with the CMH test
  3. ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data=fish)
  4. ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
  5. mantelhaen.test(ct.1)
  6. #>
  7. #> Mantel-Haenszel chi-squared test with continuity correction
  8. #>
  9. #> data: ct.1
  10. #> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
  11. #> alternative hypothesis: true common odds ratio is not equal to 1
  12. #> 95 percent confidence interval:
  13. #> 0.6005522 0.9593077
  14. #> sample estimates:
  15. #> common odds ratio
  16. #> 0.759022
  17. mantelhaen.test(ct.2)
  18. #>
  19. #> Mantel-Haenszel chi-squared test with continuity correction
  20. #>
  21. #> data: ct.2
  22. #> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
  23. #> alternative hypothesis: true common odds ratio is not equal to 1
  24. #> 95 percent confidence interval:
  25. #> 0.6005522 0.9593077
  26. #> sample estimates:
  27. #> common odds ratio
  28. #> 0.759022
  29. # With Allele last, we get a different result
  30. ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data=fish)
  31. ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data=fish)
  32. mantelhaen.test(ct.3)
  33. #>
  34. #> Cochran-Mantel-Haenszel test
  35. #>
  36. #> data: ct.3
  37. #> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
  38. mantelhaen.test(ct.4)
  39. #>
  40. #> Cochran-Mantel-Haenszel test
  41. #>
  42. #> data: ct.4
  43. #> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
  44. # With Habitat last, we get a different result
  45. ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data=fish)
  46. ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data=fish)
  47. mantelhaen.test(ct.5)
  48. #>
  49. #> Cochran-Mantel-Haenszel test
  50. #>
  51. #> data: ct.5
  52. #> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
  53. mantelhaen.test(ct.6)
  54. #>
  55. #> Cochran-Mantel-Haenszel test
  56. #>
  57. #> data: ct.6
  58. #> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689

McNemar’s test

McNemar’s test is conceptually like a within-subjects test for frequency data. For example, suppose you want to test whether a treatment increases the probability that a person will respond “yes” to a question, and that you get just one pre-treatment and one post-treatment response per person. A standard chi-square test would be inappropriate, because it assumes that the groups are independent. Instead, McNemar’s test can be used. This test can only be used when there are two measurements of a dichotomous variable. The 2x2 contingency table used for McNemar’s test bears a superficial resemblance to those used for “normal” chi-square tests, but it is different in structure.

Suppose this is your data. Each subject has a pre-treatment and post-treatment response.

  1. data <- read.table(header=TRUE, text='
  2. subject time result
  3. 1 pre 0
  4. 1 post 1
  5. 2 pre 1
  6. 2 post 1
  7. 3 pre 0
  8. 3 post 1
  9. 4 pre 1
  10. 4 post 0
  11. 5 pre 1
  12. 5 post 1
  13. 6 pre 0
  14. 6 post 1
  15. 7 pre 0
  16. 7 post 1
  17. 8 pre 0
  18. 8 post 1
  19. 9 pre 0
  20. 9 post 1
  21. 10 pre 1
  22. 10 post 1
  23. 11 pre 0
  24. 11 post 0
  25. 12 pre 1
  26. 12 post 1
  27. 13 pre 0
  28. 13 post 1
  29. 14 pre 0
  30. 14 post 0
  31. 15 pre 0
  32. 15 post 1
  33. ')

If your data is not already in wide format, it must be converted (see this page for more information):

  1. library(tidyr)
  2. data_wide <- spread(data, time, result)
  3. data_wide
  4. #> subject post pre
  5. #> 1 1 1 0
  6. #> 2 2 1 1
  7. #> 3 3 1 0
  8. #> 4 4 0 1
  9. #> 5 5 1 1
  10. #> 6 6 1 0
  11. #> 7 7 1 0
  12. #> 8 8 1 0
  13. #> 9 9 1 0
  14. #> 10 10 1 1
  15. #> 11 11 0 0
  16. #> 12 12 1 1
  17. #> 13 13 1 0
  18. #> 14 14 0 0
  19. #> 15 15 1 0

Next, generate the contingency table from just the pre and post columns from the data frame:

  1. ct <- table( data_wide[,c("pre","post")] )
  2. ct
  3. #> post
  4. #> pre 0 1
  5. #> 0 2 8
  6. #> 1 1 4
  7. # The contingency table above puts each subject into one of four cells, depending
  8. # on their pre- and post-treatment response. Note that it it is different from
  9. # the contingency table used for a "normal" chi-square test, shown below. The table
  10. # below does not account for repeated measurements, and so it is not useful for
  11. # the purposes here.
  12. # table(data[,c("time","result")])
  13. # result
  14. # time 0 1
  15. # post 3 12
  16. # pre 10 5

After creating the appropriate contingency table, run the test:

  1. mcnemar.test(ct)
  2. #>
  3. #> McNemar's Chi-squared test with continuity correction
  4. #>
  5. #> data: ct
  6. #> McNemar's chi-squared = 4, df = 1, p-value = 0.0455

For small sample sizes, it uses a continuity correction. Instead of using this correction, you can use an exact version of McNemar’s test, which is more accurate. It is available in the package exact2x2.

  1. library(exact2x2)
  2. #> Loading required package: exactci
  3. #> Loading required package: ssanv
  4. mcnemar.exact(ct)
  5. #>
  6. #> Exact McNemar test (with central confidence intervals)
  7. #>
  8. #> data: ct
  9. #> b = 8, c = 1, p-value = 0.03906
  10. #> alternative hypothesis: true odds ratio is not equal to 1
  11. #> 95 percent confidence interval:
  12. #> 1.072554 354.981246
  13. #> sample estimates:
  14. #> odds ratio
  15. #> 8