第 7 章统计分析 - 7.3 频率检验 - 《Cookbook for R 中文版》

7.3 频率检验
- 7.3.1 问题
- 7.3.2 方案

7.3 频率检验

7.3.1 问题

你有分类数据然后想要检验是否这些数据值的频数分布是否与预期不符，或者是否组间的频数分布有（显著）差异。

7.3.2 方案

频数检验通常解决两类问题：

频数分布与预期或者理论的分布（比如 50% 的 yes，50% 的 no ）符合吗？（拟合优度检验）
两组或多组之间的频率分布有差异吗？（独立检验）通常用于解决这样问题的统计检验方法，分为精确检验与近似检验两种。

	期望分布	比较组别
精确	精确二项检验	Fisher精确检验
近似	卡方拟合优度	独立卡方检验

注意：精确二项检验仅能用于有两个水平的单变量。Fisher 精确检验仅能用于二维列联表（比如，当存在一个独立变量和一个非独立变量时它可以使用；但不能用于两个独立变量和一个非独立变量的情况）。

想要检验配对或被试内效应，我们可以使用 McNemar 检验。使用该检验必须满足存在两个水平的独立变量和两个水平的非独立变量。

想要检验有重复测量的两个变量独立性，我们可以使用 Cochran-Mantel-Haenszel 检验。

假设你有下面的数据，其中每一行代表一个记录：

data <- read.table(header = TRUE, text = "
 condition result
   control      0
   control      0
   control      0
   control      0
 treatment      1
   control      0
   control      0
 treatment      0
 treatment      1
   control      1
 treatment      1
 treatment      1
 treatment      1
 treatment      1
 treatment      0
   control      0
   control      1
   control      0
   control      1
 treatment      0
 treatment      1
 treatment      0
 treatment      0
   control      0
 treatment      1
   control      0
   control      0
 treatment      1
 treatment      0
 treatment      1
")

相比于以记录的数据框存储，你的数据可能是计数的数据框，或者是一个列联表。

7.3.2.1 拟合优度检验（期望频率）

7.3.2.1.1 卡方检验

想要检验假设：结果列 result（忽略条件 condition ）中的两个值在总体中几乎相等（50% - 50%）。

# 　为result列创建列联表，包含 0 和 1 两个值 　注意，'0'
# 和 '1' 是列名而不是实际的值
ct <- table(data$result)
ct
#> 
#>  0  1 
#> 17 13
# 也可以手动创建表格
ct <- matrix(c(17, 13), ncol = 2)
colnames(ct) <- c("0", "1")
# 执行卡方检验
chisq.test(ct)
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  ct
#> X-squared = 0.53, df = 1, p-value = 0.5

想要检验有不同期望频率的样本（比如下面一个 0.75，一个 0.25 ）：

# 概率表 —— 和必须为 1
pt <- c(0.75, 0.25)
chisq.test(ct, p = pt)
#> 
#>  Chi-squared test for given probabilities
#> 
#> data:  ct
#> X-squared = 5.4, df = 1, p-value = 0.02

如果你想要从检验结果中提取信息，可以将结果保存进一个变量，然后用 str() 函数查看变量信息，接着把你想要的部分取出来。例如：

chi_res <- chisq.test(ct, p = pt)
# 查看所有组分
str(chi_res)
#> List of 9
#>  $ statistic: Named num 5.38
#>   ..- attr(*, "names")= chr "X-squared"
#>  $ parameter: Named num 1
#>   ..- attr(*, "names")= chr "df"
#>  $ p.value  : num 0.0204
#>  $ method   : chr "Chi-squared test for given probabilities"
#>  $ data.name: chr "ct"
#>  $ observed : num [1:2] 17 13
#>  $ expected : num [1:2] 22.5 7.5
#>  $ residuals: num [1:2] -1.16 2.01
#>  $ stdres   : num [1:2] -2.32 2.32
#>  - attr(*, "class")= chr "htest"
# 获取卡方值
chi_res$statistic
#> X-squared 
#>     5.378
# 获取p值
chi_res$p.value
#> [1] 0.02039

7.3.2.1.2 精确二项检验

精确二项检验仅能用于存在两个值的单变量数据。

ct <- table(data$result)
ct
#> 
#>  0  1 
#> 17 13
binom.test(ct, p = 0.5)
#> 
#>  Exact binomial test
#> 
#> data:  ct
#> number of successes = 17, number of trials = 30,
#> p-value = 0.6
#> alternative hypothesis: true probability of success is not equal to 0.5
#> 95 percent confidence interval:
#>  0.3743 0.7454
#> sample estimates:
#> probability of success 
#>                 0.5667
# 使用 75% 的期望概率——注意 1 在第二列，所以只需要令 p
# = 0.25
binom.test(ct, p = 0.25)
#> 
#>  Exact binomial test
#> 
#> data:  ct
#> number of successes = 17, number of trials = 30,
#> p-value = 2e-04
#> alternative hypothesis: true probability of success is not equal to 0.25
#> 95 percent confidence interval:
#>  0.3743 0.7454
#> sample estimates:
#> probability of success 
#>                 0.5667

如果你想要从检验结果中提取信息，可以将结果保存进一个变量，然后用 str() 函数查看变量信息，接着把你想要的部分取出来。例如：

bin_res <- binom.test(ct, p = 0.25)
# 字符串格式化后查看信息
str(bin_res)
#> List of 9
#>  $ statistic  : Named num 17
#>   ..- attr(*, "names")= chr "number of successes"
#>  $ parameter  : Named num 30
#>   ..- attr(*, "names")= chr "number of trials"
#>  $ p.value    : Named num 0.000216
#>   ..- attr(*, "names")= chr "0"
#>  $ conf.int   : num [1:2] 0.374 0.745
#>   ..- attr(*, "conf.level")= num 0.95
#>  $ estimate   : Named num 0.567
#>   ..- attr(*, "names")= chr "probability of success"
#>  $ null.value : Named num 0.25
#>   ..- attr(*, "names")= chr "probability of success"
#>  $ alternative: chr "two.sided"
#>  $ method     : chr "Exact binomial test"
#>  $ data.name  : chr "ct"
#>  - attr(*, "class")= chr "htest"
# 获取 p 值
bin_res$p.value
#>         0 
#> 0.0002157
# 获取 95% 置信区间
bin_res$conf.int
#> [1] 0.3743 0.7454
#> attr(,"conf.level")
#> [1] 0.95

7.3.2.2 独立检验（比较组间）

7.3.2.2.1 卡方检验

想要检验控制和处理组结果的频数差异，使用二维列联表。

ct <- table(data$condition, data$result)
ct
#>            
#>              0  1
#>   control   11  3
#>   treatment  6 10
chisq.test(ct)
#> 
#>  Pearson's Chi-squared test with Yates'
#>  continuity correction
#> 
#> data:  ct
#> X-squared = 3.6, df = 1, p-value = 0.06
chisq.test(ct, correct = FALSE)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  ct
#> X-squared = 5.1, df = 1, p-value = 0.02

7.3.2.2.2 Fisher 精确检验

对于小样本而言 Fisher 精确检验更为适合。小样本的 2x2 列表非常典型，样本更多、更复杂的列表计算强度非常大。当然，用R进行比较复杂的计算也是没有太大问题的。

ct <- table(data$condition, data$result)
ct
#>            
#>              0  1
#>   control   11  3
#>   treatment  6 10
fisher.test(ct)
#> 
#>  Fisher's Exact Test for Count Data
#> 
#> data:  ct
#> p-value = 0.03
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>   0.9669 45.5550
#> sample estimates:
#> odds ratio 
#>      5.714

7.3.2.2.3 Cochran-Mantel-Haenszel 检验

Cochran-Mantel-Haenszel 检验（或称为 Mantel-Haenszel 检验)）用于检验重复测量两离散变量的独立性。通常使用 2x2xK 列表表示，K是测量条件的次数。比如你想要指导是否一个处理（C vs. D）是否影响了恢复的概率（yes or no）。假设该处理一天监控测量三次——早上、中午和晚上，而你想要你的检验能够控制它。那么你可以使用 CMH 检验对 2x2x3 列联表进行操作，第三个变量是你想要控制的变量。

R 中的 CMH 检验可以处理比 2x2xK 维度更高的数据，例如你处理 3x3xK 列联表。

在接下来的例子里有三个变量：Location、Allele 和 Habitat。问题是——当控制 location 变量时，Allel（94 或非 94）和 Habitat（marine 或 estuarine）两个变量是否独立。

fish <- read.table(header = TRUE, text = "
  Location Allele   Habitat Count
 tillamook     94    marine    56
 tillamook     94 estuarine    69
 tillamook non-94    marine    40
 tillamook non-94 estuarine    77
   yaquina     94    marine    61
   yaquina     94 estuarine   257
   yaquina non-94    marine    57
   yaquina non-94 estuarine   301
   alsea     94    marine    73
   alsea     94 estuarine    65
   alsea non-94    marine    71
   alsea non-94 estuarine    79
  umpqua     94    marine    71
  umpqua     94 estuarine    48
  umpqua non-94    marine    55
  umpqua non-94 estuarine    48
")

注意上面的数据是计数的数据框，而不是像之前的例子是记录的数据框。这里我们使用 xtabs() 函数将它转换为列联表。

# 制造一个三维的列联表，最后一个变量时要控制的 Location
# 变量
ct <- xtabs(Count ~ Allele + Habitat + Location, data = fish)
ct
#> , , Location = alsea
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94            65     73
#>   non-94        79     71
#> 
#> , , Location = tillamook
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94            69     56
#>   non-94        77     40
#> 
#> , , Location = umpqua
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94            48     71
#>   non-94        48     55
#> 
#> , , Location = yaquina
#> 
#>         Habitat
#> Allele   estuarine marine
#>   94           257     61
#>   non-94       301     57
# 以扁平化显示
ftable(ct)
#>                  Location alsea tillamook umpqua yaquina
#> Allele Habitat                                          
#> 94     estuarine             65        69     48     257
#>        marine                73        56     71      61
#> non-94 estuarine             79        77     48     301
#>        marine                71        40     55      57
# 按指定方式进行变量输出
ftable(ct, row.vars = c("Location", "Allele"), col.vars = "Habitat")
#>                  Habitat estuarine marine
#> Location  Allele                         
#> alsea     94                    65     73
#>           non-94                79     71
#> tillamook 94                    69     56
#>           non-94                77     40
#> umpqua    94                    48     71
#>           non-94                48     55
#> yaquina   94                   257     61
#>           non-94               301     57
mantelhaen.test(ct)
#> 
#>  Mantel-Haenszel chi-squared test with
#>  continuity correction
#> 
#> data:  ct
#> Mantel-Haenszel X-squared = 5, df = 1, p-value =
#> 0.02
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6006 0.9593
#> sample estimates:
#> common odds ratio 
#>             0.759

根据检验结果，当控制 Location 变量时 Allele 与 Habitat 变量存在相关（p = 0.025）。

注意列联表的前两个维度处理是一致的，所以前后顺序变化都不会影响结果。而最后一个变量变化会导致结果的不同，下面是一个实例。

# 下面两个看似不同的列联表，实际检验结果相同
ct.1 <- xtabs(Count ~ Habitat + Allele + Location, data = fish)
ct.2 <- xtabs(Count ~ Allele + Habitat + Location, data = fish)
mantelhaen.test(ct.1)
#> 
#>  Mantel-Haenszel chi-squared test with
#>  continuity correction
#> 
#> data:  ct.1
#> Mantel-Haenszel X-squared = 5, df = 1, p-value =
#> 0.02
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6006 0.9593
#> sample estimates:
#> common odds ratio 
#>             0.759
mantelhaen.test(ct.2)
#> 
#>  Mantel-Haenszel chi-squared test with
#>  continuity correction
#> 
#> data:  ct.2
#> Mantel-Haenszel X-squared = 5, df = 1, p-value =
#> 0.02
#> alternative hypothesis: true common odds ratio is not equal to 1
#> 95 percent confidence interval:
#>  0.6006 0.9593
#> sample estimates:
#> common odds ratio 
#>             0.759
# 把 Allele 放到最后，结果不同了
ct.3 <- xtabs(Count ~ Location + Habitat + Allele, data = fish)
ct.4 <- xtabs(Count ~ Habitat + Location + Allele, data = fish)
mantelhaen.test(ct.3)
#> 
#>  Cochran-Mantel-Haenszel test
#> 
#> data:  ct.3
#> Cochran-Mantel-Haenszel M^2 = 168, df = 3,
#> p-value <2e-16
mantelhaen.test(ct.4)
#> 
#>  Cochran-Mantel-Haenszel test
#> 
#> data:  ct.4
#> Cochran-Mantel-Haenszel M^2 = 168, df = 3,
#> p-value <2e-16
# 把 Habitat 放最后，结果也不同
ct.5 <- xtabs(Count ~ Allele + Location + Habitat, data = fish)
ct.6 <- xtabs(Count ~ Location + Allele + Habitat, data = fish)
mantelhaen.test(ct.5)
#> 
#>  Cochran-Mantel-Haenszel test
#> 
#> data:  ct.5
#> Cochran-Mantel-Haenszel M^2 = 2, df = 3, p-value
#> = 0.6
mantelhaen.test(ct.6)
#> 
#>  Cochran-Mantel-Haenszel test
#> 
#> data:  ct.6
#> Cochran-Mantel-Haenszel M^2 = 2, df = 3, p-value
#> = 0.6

7.3.2.3 McNemar 检验

McNemar 检验概念上是频数数据的一个被试内检验。例如，假设你想要检验是否一个处理增加了一个人对某个问题反应「yes」的概率，而且你只有每个人处理前和处理后的数据。标准的卡方检验将不合适，因为它假设了组别是独立的。取而代之，我们可以使用 McNemar 检验。该检验仅适用于当存在一个独立变量的两次测量时。用于 McNemar 的列联表与用于卡方检验的非常相似，但结构上是不同的。

假设你有下面的数据。每个对象有处理前和后的反应。

data <- read.table(header = TRUE, text = "
 subject time result
     1  pre      0
     1 post      1
     2  pre      1
     2 post      1
     3  pre      0
     3 post      1
     4  pre      1
     4 post      0
     5  pre      1
     5 post      1
     6  pre      0
     6 post      1
     7  pre      0
     7 post      1
     8  pre      0
     8 post      1
     9  pre      0
     9 post      1
    10  pre      1
    10 post      1
    11  pre      0
    11 post      0
    12  pre      1
    12 post      1
    13  pre      0
    13 post      1
    14  pre      0
    14 post      0
    15  pre      0
    15 post      1
")

library(tidyr)
data_wide <- spread(data, time, result)
data_wide
#>    subject post pre
#> 1        1    1   0
#> 2        2    1   1
#> 3        3    1   0
#> 4        4    0   1
#> 5        5    1   1
#> 6        6    1   0
#> 7        7    1   0
#> 8        8    1   0
#> 9        9    1   0
#> 10      10    1   1
#> 11      11    0   0
#> 12      12    1   1
#> 13      13    1   0
#> 14      14    0   0
#> 15      15    1   0

接下来从数据框的 pre 和 post 列生成列联表：

ct <- table(data_wide[, c("pre", "post")])
ct
#>    post
#> pre 0 1
#>   0 2 8
#>   1 1 4
# 下面是用于标准卡方检验的列联表，注意差别
table(data[, c("time", "result")])
#>       result
#> time    0  1
#>   post  3 12
#>   pre  10  5

执行检验：

mcnemar.test(ct)
#> 
#>  McNemar's Chi-squared test with continuity
#>  correction
#> 
#> data:  ct
#> McNemar's chi-squared = 4, df = 1, p-value =
#> 0.05

对于小样本，它会使用连续校正。我们可以使用精确校正的 McNemar 检验替换这种校正方式，前者更加的精确，可通过 exact2x2 包获取。

library(exact2x2)
#> Loading required package: exactci
#> Loading required package: ssanv
mcnemar.exact(ct)
#> 
#>  Exact McNemar test (with central confidence
#>  intervals)
#> 
#> data:  ct
#> b = 8, c = 1, p-value = 0.04
#> alternative hypothesis: true odds ratio is not equal to 1
#> 95 percent confidence interval:
#>    1.073 354.981
#> sample estimates:
#> odds ratio 
#>          8

7.3 频率检验

7.3 频率检验

7.3.1 问题

7.3.2 方案

7.3.2.1 拟合优度检验 （期望频率）

7.3.2.1.1 卡方检验

7.3.2.1.2 精确二项检验

7.3.2.2 独立检验（比较组间）

7.3.2.2.1 卡方检验

7.3.2.2.2 Fisher 精确检验

7.3.2.2.3 Cochran-Mantel-Haenszel 检验

7.3.2.3 McNemar 检验

7.3.2.1 拟合优度检验（期望频率）