Frequency tests
Problem
You have categorical data and want test whether the frequency distribution of values differs from an expected frequency, or if the distribution differs between groups.
Solution
There are two general questions that frequency tests address:
- Does the frequency distribution differ from an expected, or theoretical, distribution (e.g., expect 50% yes, 50% no)? (Goodness-of-fit test)
- Does the frequency distribution differ between two or more groups? (Independence test) Of the statistical tests commonly used to address these questions, there are exact tests and approximate tests.
Expected distribution | Comparing groups | |
---|---|---|
Exact | Exact binomial | Fisher’s exact |
Approximate | Chi-square goodness of fit | Chi-square test of independence |
Note: The exact binomial test can only be used on one variable that has two levels. Fisher’s exact test can only be used with two-dimensional contingency tables (for example, it can be used when there is one independent variable and one dependent variable, but not when there are 2 IVs and 1 DV).
To test for paired or within-subjects effects, McNemar’s test can be used. To use it, there must be one IV with two levels, and one DV with two levels.
To test for the independence of two variables with repeated measurements, the Cochran-Mantel-Haenszel test can be used.
Suppose this is your data, where each row represents one case:
data <- read.table(header=TRUE, text='
condition result
control 0
control 0
control 0
control 0
treatment 1
control 0
control 0
treatment 0
treatment 1
control 1
treatment 1
treatment 1
treatment 1
treatment 1
treatment 0
control 0
control 1
control 0
control 1
treatment 0
treatment 1
treatment 0
treatment 0
control 0
treatment 1
control 0
control 0
treatment 1
treatment 0
treatment 1
')
Instead of a data frame of cases, your data might be stored as a data frame of counts, or as a contingency table. For the analyses here, your data must be in a contingency table. See this page for more information.
Tests of goodness-of-fit (expected frequency)
Chi-square test
To test the hypothesis that the two values in the result column (ignoring condition) are equally likely (50%-50%) in the population:
# Create contingency table for result, which contains values 0 and 1
# The column names are "0" and "1" (they're not actually values in the table)
ct <- table(data$result)
ct
#>
#> 0 1
#> 17 13
# An alternative is to manually create the table
#ct <- matrix(c(17,13), ncol=2)
#colnames(ct1) <- c("0", "1")
# Perform Chi-square test
chisq.test(ct)
#>
#> Chi-squared test for given probabilities
#>
#> data: ct
#> X-squared = 0.53333, df = 1, p-value = 0.4652
To test the sample with different expected frequencies (in this case 0.75 and 0.25):
# Probability table - must sum to 1
pt <- c(.75, .25)
chisq.test(ct, p=pt)
#>
#> Chi-squared test for given probabilities
#>
#> data: ct
#> X-squared = 5.3778, df = 1, p-value = 0.02039
If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str()
, and pull out the parts you want. For example:
chi_res <- chisq.test(ct, p=pt)
# View all the parts that can be extracted
str(chi_res)
#> List of 9
#> $ statistic: Named num 5.38
#> ..- attr(*, "names")= chr "X-squared"
#> $ parameter: Named num 1
#> ..- attr(*, "names")= chr "df"
#> $ p.value : num 0.0204
#> $ method : chr "Chi-squared test for given probabilities"
#> $ data.name: chr "ct"
#> $ observed : 'table' int [1:2(1d)] 17 13
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "0" "1"
#> $ expected : Named num [1:2] 22.5 7.5
#> ..- attr(*, "names")= chr [1:2] "0" "1"
#> $ residuals: table [1:2(1d)] -1.16 2.01
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "0" "1"
#> $ stdres : table [1:2(1d)] -2.32 2.32
#> ..- attr(*, "dimnames")=List of 1
#> .. ..$ : chr [1:2] "0" "1"
#> - attr(*, "class")= chr "htest"
# Get the Chi-squared value
chi_res$statistic
#> X-squared
#> 5.377778
# Get the p-value
chi_res$p.value
#> [1] 0.02039484
Exact binomial test
The exact binomial test is used only with data where there is one variable with two possible values.
ct <- table(data$result)
ct
#>
#> 0 1
#> 17 13
binom.test(ct, p=0.5)
#>
#> Exact binomial test
#>
#> data: ct
#> number of successes = 17, number of trials = 30, p-value = 0.5847
#> alternative hypothesis: true probability of success is not equal to 0.5
#> 95 percent confidence interval:
#> 0.3742735 0.7453925
#> sample estimates:
#> probability of success
#> 0.5666667
# Using expected probability of 75% -- notice that 1 is in the second column of
# the table so need to use p=.25
binom.test(ct, p=0.25)
#>
#> Exact binomial test
#>
#> data: ct
#> number of successes = 17, number of trials = 30, p-value = 0.0002157
#> alternative hypothesis: true probability of success is not equal to 0.25
#> 95 percent confidence interval:
#> 0.3742735 0.7453925
#> sample estimates:
#> probability of success
#> 0.5666667
If you want to extract information out of the test result, you can save the result into a variable, examine the variable with str()
, and pull out the parts you want. For example:
bin_res <- binom.test(ct, p=0.25)
# View all the parts that can be extracted
str(bin_res)
#> List of 9
#> $ statistic : Named num 17
#> ..- attr(*, "names")= chr "number of successes"
#> $ parameter : Named num 30
#> ..- attr(*, "names")= chr "number of trials"
#> $ p.value : Named num 0.000216
#> ..- attr(*, "names")= chr "0"
#> $ conf.int : atomic [1:2] 0.374 0.745
#> ..- attr(*, "conf.level")= num 0.95
#> $ estimate : Named num 0.567
#> ..- attr(*, "names")= chr "probability of success"
#> $ null.value : Named num 0.25
#> ..- attr(*, "names")= chr "probability of success"
#> $ alternative: chr "two.sided"
#> $ method : chr "Exact binomial test"
#> $ data.name : chr "ct"
#> - attr(*, "class")= chr "htest"
# Get the p-value
bin_res$p.value
#> 0
#> 0.0002156938
# Get the 95% confidence interval
bin_res$conf.int
#> [1] 0.3742735 0.7453925
#> attr(,"conf.level")
#> [1] 0.95
Tests of independence (comparing groups)
Chi-square test
To test whether the control and treatment conditions result in different frequencies, use a 2D contingency table.
ct <- table(data$condition, data$result)
ct
#>
#> 0 1
#> control 11 3
#> treatment 6 10
chisq.test(ct)
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: ct
#> X-squared = 3.593, df = 1, p-value = 0.05802
chisq.test(ct, correct=FALSE)
#>
#> Pearson's Chi-squared test
#>
#> data: ct
#> X-squared = 5.1293, df = 1, p-value = 0.02353
For 2x2 tables, it uses Yates’s continuity correction by default. This test is more conservative for small sample sizes.The flag correct=FALSE
, tells it to use Pearson’s chi-square test without the correction.
Fisher’s exact test
For small sample sizes, Fisher’s exact test may be more appropriate. It is typically used for 2x2 tables with relatively small sample sizes because it is computationally intensive for more complicated (e.g., 2x3) tables, and those with larger sample sizes. However, the implementation in R seems to handle larger cases without trouble.
ct <- table(data$condition, data$result)
ct
#>
#> 0 1
#> control 11 3
#> treatment 6 10
fisher.test(ct)
#>
#> Fisher's Exact Test for Count Data
#>
#> data: ct
#> p-value = 0.03293
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.966861 45.555016
#> sample estimates:
#> odds ratio
#> 5.714369
Cochran-Mantel-Haenszel test
The Cochran-Mantel-Haenszel test (or Mantel-Haenszel test) is used for testing the independence of two dichotomous variables with repeated measurements. It is most commonly used with 2x2xK tables, where K is the number of measurement conditions. For example, you may want to know whether a treatment (C vs. D) affects the likelihood of recovery (yes or no). Suppose, though, that the treatments were administered at three different times of day, morning, afternoon, and night – and that you want your test to control for this. The CMH test would then operate on a 2x2x3 contingency table, where the third variable is the one you wish to control for.
The implementation of the CMH test in R can handle dimensions greater than 2x2xK. For example, you could use it for a 3x3xK contingency table.
In the following example (borrowed from McDonald’s Handbook of Biological Statistics), there are three variables: Location, Allele, and Habitat. The question is whether Allele (94 or non-94) and Habitat (marine or estuarine) are independent, when location is controlled for.
fish <- read.table(header=TRUE, text='
Location Allele Habitat Count
tillamook 94 marine 56
tillamook 94 estuarine 69
tillamook non-94 marine 40
tillamook non-94 estuarine 77
yaquina 94 marine 61
yaquina 94 estuarine 257
yaquina non-94 marine 57
yaquina non-94 estuarine 301
alsea 94 marine 73
alsea 94 estuarine 65
alsea non-94 marine 71
alsea non-94 estuarine 79
umpqua 94 marine 71
umpqua 94 estuarine 48
umpqua non-94 marine 55
umpqua non-94 estuarine 48
')
Note that the data above is entered as a data frame of counts, instead of a data frame of cases as in previous examples. Instead of using table()
to convert it to a contingency table, use xtabs()
instead. For more information, see this page.
# Make a 3D contingency table, where the last variable, Location, is the one to
# control for. If you use table() for case data, the last variable is also the
# one to control for.
ct <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
ct
#> , , Location = alsea
#>
#> Habitat
#> Allele estuarine marine
#> 94 65 73
#> non-94 79 71
#>
#> , , Location = tillamook
#>
#> Habitat
#> Allele estuarine marine
#> 94 69 56
#> non-94 77 40
#>
#> , , Location = umpqua
#>
#> Habitat
#> Allele estuarine marine
#> 94 48 71
#> non-94 48 55
#>
#> , , Location = yaquina
#>
#> Habitat
#> Allele estuarine marine
#> 94 257 61
#> non-94 301 57
# This prints ct in a "flat" format
ftable(ct)
#> Location alsea tillamook umpqua yaquina
#> Allele Habitat
#> 94 estuarine 65 69 48 257
#> marine 73 56 71 61
#> non-94 estuarine 79 77 48 301
#> marine 71 40 55 57
# Print with different arrangement of variables
ftable(ct, row.vars=c("Location","Allele"), col.vars="Habitat")
#> Habitat estuarine marine
#> Location Allele
#> alsea 94 65 73
#> non-94 79 71
#> tillamook 94 69 56
#> non-94 77 40
#> umpqua 94 48 71
#> non-94 48 55
#> yaquina 94 257 61
#> non-94 301 57
mantelhaen.test(ct)
#>
#> Mantel-Haenszel chi-squared test with continuity correction
#>
#> data: ct
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#> 0.759022
According to this test, there is a relationship between Allele and Habitat, controlling for Location, p=.025.
Note that the first two dimensions of the contingency table are treated the same (so their order can be swapped without affecting the test result), the highest-order dimension in the contingency table is different. This is illustrated below.
# The following two create different contingency tables, but have the same result
# with the CMH test
ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data=fish)
ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data=fish)
mantelhaen.test(ct.1)
#>
#> Mantel-Haenszel chi-squared test with continuity correction
#>
#> data: ct.1
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#> 0.759022
mantelhaen.test(ct.2)
#>
#> Mantel-Haenszel chi-squared test with continuity correction
#>
#> data: ct.2
#> Mantel-Haenszel X-squared = 5.0497, df = 1, p-value = 0.02463
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 0.6005522 0.9593077
#> sample estimates:
#> common odds ratio
#> 0.759022
# With Allele last, we get a different result
ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data=fish)
ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data=fish)
mantelhaen.test(ct.3)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.3
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
mantelhaen.test(ct.4)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.4
#> Cochran-Mantel-Haenszel M^2 = 168.47, df = 3, p-value < 2.2e-16
# With Habitat last, we get a different result
ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data=fish)
ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data=fish)
mantelhaen.test(ct.5)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.5
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
mantelhaen.test(ct.6)
#>
#> Cochran-Mantel-Haenszel test
#>
#> data: ct.6
#> Cochran-Mantel-Haenszel M^2 = 2.0168, df = 3, p-value = 0.5689
McNemar’s test
McNemar’s test is conceptually like a within-subjects test for frequency data. For example, suppose you want to test whether a treatment increases the probability that a person will respond “yes” to a question, and that you get just one pre-treatment and one post-treatment response per person. A standard chi-square test would be inappropriate, because it assumes that the groups are independent. Instead, McNemar’s test can be used. This test can only be used when there are two measurements of a dichotomous variable. The 2x2 contingency table used for McNemar’s test bears a superficial resemblance to those used for “normal” chi-square tests, but it is different in structure.
Suppose this is your data. Each subject has a pre-treatment and post-treatment response.
data <- read.table(header=TRUE, text='
subject time result
1 pre 0
1 post 1
2 pre 1
2 post 1
3 pre 0
3 post 1
4 pre 1
4 post 0
5 pre 1
5 post 1
6 pre 0
6 post 1
7 pre 0
7 post 1
8 pre 0
8 post 1
9 pre 0
9 post 1
10 pre 1
10 post 1
11 pre 0
11 post 0
12 pre 1
12 post 1
13 pre 0
13 post 1
14 pre 0
14 post 0
15 pre 0
15 post 1
')
If your data is not already in wide format, it must be converted (see this page for more information):
library(tidyr)
data_wide <- spread(data, time, result)
data_wide
#> subject post pre
#> 1 1 1 0
#> 2 2 1 1
#> 3 3 1 0
#> 4 4 0 1
#> 5 5 1 1
#> 6 6 1 0
#> 7 7 1 0
#> 8 8 1 0
#> 9 9 1 0
#> 10 10 1 1
#> 11 11 0 0
#> 12 12 1 1
#> 13 13 1 0
#> 14 14 0 0
#> 15 15 1 0
Next, generate the contingency table from just the pre and post columns from the data frame:
ct <- table( data_wide[,c("pre","post")] )
ct
#> post
#> pre 0 1
#> 0 2 8
#> 1 1 4
# The contingency table above puts each subject into one of four cells, depending
# on their pre- and post-treatment response. Note that it it is different from
# the contingency table used for a "normal" chi-square test, shown below. The table
# below does not account for repeated measurements, and so it is not useful for
# the purposes here.
# table(data[,c("time","result")])
# result
# time 0 1
# post 3 12
# pre 10 5
After creating the appropriate contingency table, run the test:
mcnemar.test(ct)
#>
#> McNemar's Chi-squared test with continuity correction
#>
#> data: ct
#> McNemar's chi-squared = 4, df = 1, p-value = 0.0455
For small sample sizes, it uses a continuity correction. Instead of using this correction, you can use an exact version of McNemar’s test, which is more accurate. It is available in the package exact2x2
.
library(exact2x2)
#> Loading required package: exactci
#> Loading required package: ssanv
mcnemar.exact(ct)
#>
#> Exact McNemar test (with central confidence intervals)
#>
#> data: ct
#> b = 8, c = 1, p-value = 0.03906
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#> 1.072554 354.981246
#> sample estimates:
#> odds ratio
#> 8