6.18 汇总数据
6.18.1 问题
你想要按照组别汇总你的数据(均值、标准差等等)。
6.18.2 方案
有三种方法描述基于一些特定变量的分组数据,然后对每一组使用汇总函数(像均值、标准差等等)。
ddply()
函数:它比较容易使用,但需要载入 plyr 包。这种方法可能就是你要找的。summaryBy()
函数:它也比较容易使用,然而它需要载入 doBy 包。aggregate()
函数,它比较难使用一点但内置于 R 中。
假设你有以下数据并想求得每一组样本大小、均值的改变、标准差以及均值的标准误,而这里的组别是根据性别和条件指定的:F-placebo, F-aspirin, M-placebo和 M-aspirin。
data <- read.table(header=TRUE, text='
subject sex condition before after change
1 F placebo 10.1 6.9 -3.2
2 F placebo 6.3 4.2 -2.1
3 M aspirin 12.4 6.3 -6.1
4 F placebo 8.1 6.1 -2.0
5 M aspirin 15.2 9.9 -5.3
6 F aspirin 10.9 7.0 -3.9
7 F aspirin 11.6 8.5 -3.1
8 M aspirin 9.5 3.0 -6.5
9 F placebo 11.5 9.0 -2.5
10 M placebo 11.9 11.0 -0.9
11 F aspirin 11.4 8.0 -3.4
12 M aspirin 10.0 4.4 -5.6
13 M aspirin 12.5 5.4 -7.1
14 M placebo 10.6 10.6 0.0
15 M aspirin 9.1 4.3 -4.8
16 F placebo 12.1 10.2 -1.9
17 F placebo 11.0 8.8 -2.2
18 F placebo 11.9 10.2 -1.7
19 M aspirin 9.1 3.6 -5.5
20 M placebo 13.5 12.4 -1.1
21 M aspirin 12.0 7.5 -4.5
22 F placebo 9.1 7.6 -1.5
23 M placebo 9.9 8.0 -1.9
24 F placebo 7.6 5.2 -2.4
25 F placebo 11.8 9.7 -2.1
26 F placebo 11.8 10.7 -1.1
27 F aspirin 10.1 7.9 -2.2
28 M aspirin 11.6 8.3 -3.3
29 F aspirin 11.3 6.8 -4.5
30 F placebo 10.3 8.3 -2.0
')
6.18.2.1 使用 ddply
library(plyr)
# 给每一组运行长度、均值、标准差等函数
# 每一组依据性别+条件划分
cdata <- ddply(data, c("sex", "condition"), summarise, N = length(change),
mean = mean(change), sd = sd(change), se = sd/sqrt(N))
cdata
#> sex condition N mean sd se
#> 1 F aspirin 5 -3.420 0.8643 0.3865
#> 2 F placebo 12 -2.058 0.5248 0.1515
#> 3 M aspirin 9 -5.411 1.1308 0.3769
#> 4 M placebo 4 -0.975 0.7805 0.3902
6.18.2.1.1 处理缺失值
如果数据中存在 NA 值,需要给每个函数添加 na.rm=TRUE
标记去除缺失值。因为函数 length()
没有 na.rm
选项,所以可以使用 sum(!is.na(…))
的方式对非缺失值进行计数。
# 给数据加些 NA 值
dataNA <- data
dataNA$change[11:14] <- NA
cdata <- ddply(dataNA, c("sex", "condition"), summarise,
N = sum(!is.na(change)), mean = mean(change, na.rm = TRUE),
sd = sd(change, na.rm = TRUE), se = sd/sqrt(N))
cdata
#> sex condition N mean sd se
#> 1 F aspirin 4 -3.425 0.9979 0.4990
#> 2 F placebo 12 -2.058 0.5248 0.1515
#> 3 M aspirin 7 -5.143 1.0675 0.4035
#> 4 M placebo 3 -1.300 0.5292 0.3055
6.18.2.1.2 自动汇总函数
不像我们刚才手动地指定想要的值然后计算标准误,这个函数可以自动处理所有的细节。它可以干以下的事情:
- 寻找均值、标准差和计数
- 寻找均值的标准误(强调,如果你处理的是被试内变量这可能不是你想要的)
- 寻找 95% 的置信区间(也可以指定其他值)
- 重命令结果数据集的变量名,这样更方便后续处理
要使用的话,把函数放你的代码中然后像下面一样调用它。
## Summarizes data. Gives count, mean, standard
## deviation, standard error of the mean, and confidence
## interval (default 95%). data: a data frame.
## measurevar: the name of a column that contains the
## variable to be summariezed groupvars: a vector
## containing names of columns that contain grouping
## variables na.rm: a boolean that indicates whether to
## ignore NA's conf.interval: the percent range of the
## confidence interval (default is 95%)
summarySE <- function(data = NULL, measurevar, groupvars = NULL,
na.rm = FALSE, conf.interval = 0.95, .drop = TRUE) {
library(plyr)
# New version of length which can handle NA's: if
# na.rm==T, don't count them
length2 <- function(x, na.rm = FALSE) {
if (na.rm)
sum(!is.na(x)) else length(x)
}
# This does the summary. For each group's data frame,
# return a vector with N, mean, and sd
datac <- ddply(data, groupvars, .drop = .drop, .fun = function(xx,
col) {
c(N = length2(xx[[col]], na.rm = na.rm), mean = mean(xx[[col]],
na.rm = na.rm), sd = sd(xx[[col]], na.rm = na.rm))
}, measurevar)
# Rename the 'mean' column
datac <- rename(datac, c(mean = measurevar))
datac$se <- datac$sd/sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval: e.g.,
# if conf.interval is .95, use .975 (above/below), and
# use df=N-1
ciMult <- qt(conf.interval/2 + 0.5, datac$N - 1)
datac$ci <- datac$se * ciMult
return(datac)
}
举个例子使用它(用95%的置信区间)。与之前手动计算这些步骤相比 summarySE()
函数一步搞定:
summarySE(data, measurevar = "change", groupvars = c("sex",
"condition"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420 0.8643 0.3865 1.0732
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 9 -5.411 1.1308 0.3769 0.8692
#> 4 M placebo 4 -0.975 0.7805 0.3902 1.2419
# 使用 NA'的数据框, 使用 na.rm=TRUE
summarySE(dataNA, measurevar = "change", groupvars = c("sex",
"condition"), na.rm = TRUE)
#> sex condition N change sd se ci
#> 1 F aspirin 4 -3.425 0.9979 0.4990 1.5879
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 7 -5.143 1.0675 0.4035 0.9873
#> 4 M placebo 3 -1.300 0.5292 0.3055 1.3145
6.18.2.1.3 用零填满空组合
有时候汇总的数据框中存在因子的空组合——意思是因子组合可能存在,但原始数据框里又没有实际出现。它在自动填满有 NA 值的数据框时有用。要做到这一点,当调用ddply()
或 summarySE()
时设置 .drop=FALSE
。
例子:
# 首先移除所有 Male+Placebo 条目
dataSub <- subset(data, !(sex == "M" & condition == "placebo"))
# 如果我们汇总数据,在本来有 Male+Placebo
# 的地方会存在空行 因为这个组合已经被我们删除了
summarySE(dataSub, measurevar = "change", groupvars = c("sex",
"condition"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420 0.8643 0.3865 1.0732
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 9 -5.411 1.1308 0.3769 0.8692
# 设置 .drop=FALSE 指定不要扔掉这个组合
summarySE(dataSub, measurevar = "change", groupvars = c("sex",
"condition"), .drop = FALSE)
#> Warning in qt(conf.interval/2 + 0.5, datac$N - 1): 产生
#> 了NaNs
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420 0.8643 0.3865 1.0732
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 9 -5.411 1.1308 0.3769 0.8692
#> 4 M placebo 0 NaN NA NA NaN
6.18.2.2 使用summaryBy
使用 summarizeBy()
函数瓦解数据:
library(doBy)
# 给每一组运行长度、均值、标准差等函数
# 每一组依据性别+条件划分
cdata <- summaryBy(change ~ sex + condition, data = data,
FUN = c(length, mean, sd))
cdata
#> sex condition change.length change.mean change.sd
#> 1 F aspirin 5 -3.420 0.8643
#> 2 F placebo 12 -2.058 0.5248
#> 3 M aspirin 9 -5.411 1.1308
#> 4 M placebo 4 -0.975 0.7805
# 重命名 change.length 为 N
names(cdata)[names(cdata) == "change.length"] <- "N"
# 计算平均值的标准误差
cdata$change.se <- cdata$change.sd/sqrt(cdata$N)
cdata
#> sex condition N change.mean change.sd change.se
#> 1 F aspirin 5 -3.420 0.8643 0.3865
#> 2 F placebo 12 -2.058 0.5248 0.1515
#> 3 M aspirin 9 -5.411 1.1308 0.3769
#> 4 M placebo 4 -0.975 0.7805 0.3902
注意,如果你有任何被试间变量,这些标准误值在比对组别差异时就没用了。
6.18.2.2.1 处理缺失值
如果数据中存在 NA
值,你需要添加 na.rm=TRUE
选项。通常你可以在 summaryBy()
函数中设置,但 length()
函数识别不了这个选项。一种解决方式是根据 length()
函数定义一个新的取长度函数去处理NA值。
# 新版的 length 函数可以处理 NA 值,如果 na.rm=T,则不对
# NA 计数
length2 <- function(x, na.rm = FALSE) {
if (na.rm)
sum(!is.na(x)) else length(x)
}
# 给数据添加一些 NA 值
dataNA <- data
dataNA$change[11:14] <- NA
cdataNA <- summaryBy(change ~ sex + condition, data = dataNA,
FUN = c(length2, mean, sd), na.rm = TRUE)
cdataNA
#> sex condition change.length2 change.mean change.sd
#> 1 F aspirin 4 -3.425 0.9979
#> 2 F placebo 12 -2.058 0.5248
#> 3 M aspirin 7 -5.143 1.0675
#> 4 M placebo 3 -1.300 0.5292
# 做些其他事情
6.18.2.2.2 自动汇总函数
注意这里的自动汇总函数与之前的不同,它是通过
summaryBy()
实现的
不像我们刚才手动地指定想要的值然后计算标准误,这个函数可以自动处理所有的细节。它可以干以下的事情:
- 寻找均值、标准差和计数
- 寻找均值的标准误
- 寻找95%的置信区间(也可以指定其他值)
- 重命令结果数据集的变量名,这样更方便后续处理
要使用的话,把函数放你的代码中然后像下面一样调用它。
## Summarizes data. Gives count, mean, standard
## deviation, standard error of the mean, and confidence
## interval (default 95%). data: a data frame.
## measurevar: the name of a column that contains the
## variable to be summariezed groupvars: a vector
## containing names of columns that contain grouping
## variables na.rm: a boolean that indicates whether to
## ignore NA's conf.interval: the percent range of the
## confidence interval (default is 95%)
summarySE <- function(data = NULL, measurevar, groupvars = NULL,
na.rm = FALSE, conf.interval = 0.95) {
library(doBy)
# New version of length which can handle NA's: if
# na.rm==T, don't count them
length2 <- function(x, na.rm = FALSE) {
if (na.rm)
sum(!is.na(x)) else length(x)
}
# Collapse the data
formula <- as.formula(paste(measurevar, paste(groupvars,
collapse = " + "), sep = " ~ "))
datac <- summaryBy(formula, data = data, FUN = c(length2,
mean, sd), na.rm = na.rm)
# Rename columns
names(datac)[names(datac) == paste(measurevar, ".mean",
sep = "")] <- measurevar
names(datac)[names(datac) == paste(measurevar, ".sd",
sep = "")] <- "sd"
names(datac)[names(datac) == paste(measurevar, ".length2",
sep = "")] <- "N"
datac$se <- datac$sd/sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval: e.g.,
# if conf.interval is .95, use .975 (above/below), and
# use df=N-1
ciMult <- qt(conf.interval/2 + 0.5, datac$N - 1)
datac$ci <- datac$se * ciMult
return(datac)
}
举个例子使用它(用95%的置信区间)。与之前手动计算这些步骤相反,summarySE()
函数一步搞定:
summarySE(data, measurevar = "change", groupvars = c("sex",
"condition"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420 0.8643 0.3865 1.0732
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 9 -5.411 1.1308 0.3769 0.8692
#> 4 M placebo 4 -0.975 0.7805 0.3902 1.2419
# 对于含有 NA 值得数据集,使用 na.rm=TRUE
summarySE(dataNA, measurevar = "change", groupvars = c("sex",
"condition"), na.rm = TRUE)
#> sex condition N change sd se ci
#> 1 F aspirin 4 -3.425 0.9979 0.4990 1.5879
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 7 -5.143 1.0675 0.4035 0.9873
#> 4 M placebo 3 -1.300 0.5292 0.3055 1.3145
6.18.2.2.3 用零填满空组合
有时候汇总的数据框中存在因子的空组合 - 这意思是,因子组合可能存在,但原始数据框里又没有实际出现。它在自动填满有 NA 值的数据框时有用。
这个例子将会用 0 填满缺失的组合:
fillMissingCombs <- function(df, factors, measures) {
# 创建含因子水平组合的列表
levelList <- list()
for (f in factors) {
levelList[[f]] <- levels(df[, f])
}
fullFactors <- expand.grid(levelList)
dfFull <- merge(fullFactors, df, all.x = TRUE)
# 将 measure 变量中的 NA 都替换为 0
for (m in measures) {
dfFull[is.na(dfFull[, m]), m] <- 0
}
return(dfFull)
}
使用例子:
# 首先移除所有 Male+Placebo 条目
dataSub <- subset(data, !(sex == "M" & condition == "placebo"))
# 如果我们汇总数据,在本来有 Male+Placebo
# 的地方会存在空行 因为这个组合已经被我们删除了
cdataSub <- summarySE(dataSub, measurevar = "change", groupvars = c("sex",
"condition"))
cdataSub
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420 0.8643 0.3865 1.0732
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 9 -5.411 1.1308 0.3769 0.8692
# 设置 .drop=FALSE 指定不要扔掉这个组合
fillMissingCombs(cdataSub, factors = c("sex", "condition"),
measures = c("N", "change", "sd", "se", "ci"))
#> sex condition N change sd se ci
#> 1 F aspirin 5 -3.420 0.8643 0.3865 1.0732
#> 2 F placebo 12 -2.058 0.5248 0.1515 0.3334
#> 3 M aspirin 9 -5.411 1.1308 0.3769 0.8692
#> 4 M placebo 0 0.000 0.0000 0.0000 0.0000
6.18.2.3 使用 aggregate()
aggregate()
函数比较难用,但它内置于 R,所以不需要安装其他包。
# 对每个目录 (sex*condition) 中的对象计数
cdata <- aggregate(data["subject"], by = data[c("sex", "condition")],
FUN = length)
cdata
#> sex condition subject
#> 1 F aspirin 5
#> 2 M aspirin 9
#> 3 F placebo 12
#> 4 M placebo 4
# 重命名 'subject' 列为 'N'
names(cdata)[names(cdata) == "subject"] <- "N"
cdata
#> sex condition N
#> 1 F aspirin 5
#> 2 M aspirin 9
#> 3 F placebo 12
#> 4 M placebo 4
# 按性别排序
cdata <- cdata[order(cdata$sex), ]
cdata
#> sex condition N
#> 1 F aspirin 5
#> 3 F placebo 12
#> 2 M aspirin 9
#> 4 M placebo 4
# 我们也保留 before 和 after列:
# 得到性别和条件下的平均影响大小 Get the average effect
# size by sex and condition
cdata.means <- aggregate(data[c("before", "after", "change")],
by = data[c("sex", "condition")], FUN = mean)
cdata.means
#> sex condition before after change
#> 1 F aspirin 11.06 7.640 -3.420
#> 2 M aspirin 11.27 5.856 -5.411
#> 3 F placebo 10.13 8.075 -2.058
#> 4 M placebo 11.47 10.500 -0.975
# 合并数据框
cdata <- merge(cdata, cdata.means)
cdata
#> sex condition N before after change
#> 1 F aspirin 5 11.06 7.640 -3.420
#> 2 F placebo 12 10.13 8.075 -2.058
#> 3 M aspirin 9 11.27 5.856 -5.411
#> 4 M placebo 4 11.47 10.500 -0.975
# 得到标准差
cdata.sd <- aggregate(data["change"], by = data[c("sex",
"condition")], FUN = sd)
# 重命名列
names(cdata.sd)[names(cdata.sd) == "change"] <- "change.sd"
cdata.sd
#> sex condition change.sd
#> 1 F aspirin 0.8643
#> 2 M aspirin 1.1308
#> 3 F placebo 0.5248
#> 4 M placebo 0.7805
# 合并
cdata <- merge(cdata, cdata.sd)
cdata
#> sex condition N before after change change.sd
#> 1 F aspirin 5 11.06 7.640 -3.420 0.8643
#> 2 F placebo 12 10.13 8.075 -2.058 0.5248
#> 3 M aspirin 9 11.27 5.856 -5.411 1.1308
#> 4 M placebo 4 11.47 10.500 -0.975 0.7805
# 计算标准误
cdata$change.se <- cdata$change.sd/sqrt(cdata$N)
cdata
#> sex condition N before after change change.sd
#> 1 F aspirin 5 11.06 7.640 -3.420 0.8643
#> 2 F placebo 12 10.13 8.075 -2.058 0.5248
#> 3 M aspirin 9 11.27 5.856 -5.411 1.1308
#> 4 M placebo 4 11.47 10.500 -0.975 0.7805
#> change.se
#> 1 0.3865
#> 2 0.1515
#> 3 0.3769
#> 4 0.3902
如果你有 NA
值想要跳过,设置 na.rm=TRUE
:
cdata.means <- aggregate(data[c("before", "after", "change")],
by = data[c("sex", "condition")], FUN = mean, na.rm = TRUE)
cdata.means
#> sex condition before after change
#> 1 F aspirin 11.06 7.640 -3.420
#> 2 M aspirin 11.27 5.856 -5.411
#> 3 F placebo 10.13 8.075 -2.058
#> 4 M placebo 11.47 10.500 -0.975