Summarizing data

Problem

You want to do summarize your data (with mean, standard deviation, etc.), broken down by group.

Solution

There are three ways described here to group data based on some specified variables, and apply a summary function (like mean, standard deviation, etc.) to each group.

  • The ddply() function. It is the easiest to use, though it requires the plyr package. This is probably what you want to use.
  • The summarizeBy() function. It is easier to use, though it requires the doBy package.
  • The aggregate() function. It is more difficult to use but is included in the base install of R.

Suppose you have this data and want to find the N, mean of change, standard deviation, and standard error of the mean for each group, where the groups are specified by each combination of sex and condition: F-placebo, F-aspirin, M-placebo, and M-aspirin.

  1. data <- read.table(header=TRUE, text='
  2. subject sex condition before after change
  3. 1 F placebo 10.1 6.9 -3.2
  4. 2 F placebo 6.3 4.2 -2.1
  5. 3 M aspirin 12.4 6.3 -6.1
  6. 4 F placebo 8.1 6.1 -2.0
  7. 5 M aspirin 15.2 9.9 -5.3
  8. 6 F aspirin 10.9 7.0 -3.9
  9. 7 F aspirin 11.6 8.5 -3.1
  10. 8 M aspirin 9.5 3.0 -6.5
  11. 9 F placebo 11.5 9.0 -2.5
  12. 10 M placebo 11.9 11.0 -0.9
  13. 11 F aspirin 11.4 8.0 -3.4
  14. 12 M aspirin 10.0 4.4 -5.6
  15. 13 M aspirin 12.5 5.4 -7.1
  16. 14 M placebo 10.6 10.6 0.0
  17. 15 M aspirin 9.1 4.3 -4.8
  18. 16 F placebo 12.1 10.2 -1.9
  19. 17 F placebo 11.0 8.8 -2.2
  20. 18 F placebo 11.9 10.2 -1.7
  21. 19 M aspirin 9.1 3.6 -5.5
  22. 20 M placebo 13.5 12.4 -1.1
  23. 21 M aspirin 12.0 7.5 -4.5
  24. 22 F placebo 9.1 7.6 -1.5
  25. 23 M placebo 9.9 8.0 -1.9
  26. 24 F placebo 7.6 5.2 -2.4
  27. 25 F placebo 11.8 9.7 -2.1
  28. 26 F placebo 11.8 10.7 -1.1
  29. 27 F aspirin 10.1 7.9 -2.2
  30. 28 M aspirin 11.6 8.3 -3.3
  31. 29 F aspirin 11.3 6.8 -4.5
  32. 30 F placebo 10.3 8.3 -2.0
  33. ')

Using ddply

  1. library(plyr)
  2. # Run the functions length, mean, and sd on the value of "change" for each group,
  3. # broken down by sex + condition
  4. cdata <- ddply(data, c("sex", "condition"), summarise,
  5. N = length(change),
  6. mean = mean(change),
  7. sd = sd(change),
  8. se = sd / sqrt(N)
  9. )
  10. cdata
  11. #> sex condition N mean sd se
  12. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230
  13. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867
  14. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190
  15. #> 4 M placebo 4 -0.975000 0.7804913 0.3902456

Handling missing data

If there are NA’s in the data, you need to pass the flag na.rm=TRUE to each of the functions. length() doesn’t take na.rm as an option, so one way to work around it is to use sum(!is.na(…)) to count how many non-NA’s there are.

  1. # Put some NA's in the data
  2. dataNA <- data
  3. dataNA$change[11:14] <- NA
  4. cdata <- ddply(dataNA, c("sex", "condition"), summarise,
  5. N = sum(!is.na(change)),
  6. mean = mean(change, na.rm=TRUE),
  7. sd = sd(change, na.rm=TRUE),
  8. se = sd / sqrt(N)
  9. )
  10. cdata
  11. #> sex condition N mean sd se
  12. #> 1 F aspirin 4 -3.425000 0.9979145 0.4989572
  13. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867
  14. #> 3 M aspirin 7 -5.142857 1.0674848 0.4034713
  15. #> 4 M placebo 3 -1.300000 0.5291503 0.3055050

A function for mean, count, standard deviation, standard error of the mean, and confidence interval

Instead of manually specifying all the values you want and then calculating the standard error, as shown above, this function will handle all of those details. It will do all the things described here:

  • Find the mean, standard deviation, and count (N)
  • Find the standard error of the mean (again, this may not be what you want if you are collapsing over a within-subject variable. See ../../Graphs/Plotting means and error bars (ggplot2) for information on how to make error bars for graphs with within-subjects variables.)
  • Find a 95% confidence interval (or other value, if desired)
  • Rename the columns so that the resulting data frame is easier to work with

To use, put this function in your code and call it as demonstrated below.

  1. ## Summarizes data.
  2. ## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
  3. ## data: a data frame.
  4. ## measurevar: the name of a column that contains the variable to be summariezed
  5. ## groupvars: a vector containing names of columns that contain grouping variables
  6. ## na.rm: a boolean that indicates whether to ignore NA's
  7. ## conf.interval: the percent range of the confidence interval (default is 95%)
  8. summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
  9. conf.interval=.95, .drop=TRUE) {
  10. library(plyr)
  11. # New version of length which can handle NA's: if na.rm==T, don't count them
  12. length2 <- function (x, na.rm=FALSE) {
  13. if (na.rm) sum(!is.na(x))
  14. else length(x)
  15. }
  16. # This does the summary. For each group's data frame, return a vector with
  17. # N, mean, and sd
  18. datac <- ddply(data, groupvars, .drop=.drop,
  19. .fun = function(xx, col) {
  20. c(N = length2(xx[[col]], na.rm=na.rm),
  21. mean = mean (xx[[col]], na.rm=na.rm),
  22. sd = sd (xx[[col]], na.rm=na.rm)
  23. )
  24. },
  25. measurevar
  26. )
  27. # Rename the "mean" column
  28. datac <- rename(datac, c("mean" = measurevar))
  29. datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
  30. # Confidence interval multiplier for standard error
  31. # Calculate t-statistic for confidence interval:
  32. # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
  33. ciMult <- qt(conf.interval/2 + .5, datac$N-1)
  34. datac$ci <- datac$se * ciMult
  35. return(datac)
  36. }

Example usage (with 95% confidence interval). Instead of doing all the steps manually, as done previously, the summarySE function does it all in one step:

  1. summarySE(data, measurevar="change", groupvars=c("sex", "condition"))
  2. #> sex condition N change sd se ci
  3. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
  4. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  5. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
  6. #> 4 M placebo 4 -0.975000 0.7804913 0.3902456 1.2419358
  7. # With a data set with NA's, use na.rm=TRUE
  8. summarySE(dataNA, measurevar="change", groupvars=c("sex", "condition"), na.rm=TRUE)
  9. #> sex condition N change sd se ci
  10. #> 1 F aspirin 4 -3.425000 0.9979145 0.4989572 1.5879046
  11. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  12. #> 3 M aspirin 7 -5.142857 1.0674848 0.4034713 0.9872588
  13. #> 4 M placebo 3 -1.300000 0.5291503 0.3055050 1.3144821

Filling empty combinations with zeros

Sometimes there will be empty combinations of factors in the summary data frame – that is, combinations of factors that are possible, but don’t actually occur in the original data frame. It is often useful to automatically fill in those combinations in the summary data frame with NA’s. To do this, set .drop=FALSE in the call to ddply or summarySE.

Example usage:

  1. # First remove some all Male+Placebo entries from the data
  2. dataSub <- subset(data, !(sex=="M" & condition=="placebo"))
  3. # If we summarize the data, there will be a missing row for Male+Placebo,
  4. # since there were no cases with this combination.
  5. summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
  6. #> sex condition N change sd se ci
  7. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
  8. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  9. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
  10. # Set .drop=FALSE to NOT drop those combinations
  11. summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"), .drop=FALSE)
  12. #> Warning in qt(conf.interval/2 + 0.5, datac$N - 1): NaNs produced
  13. #> sex condition N change sd se ci
  14. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
  15. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  16. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
  17. #> 4 M placebo 0 NaN NA NA NA

Using summaryBy

To collapse the data using the summarizeBy() function:

  1. library(doBy)
  2. # Run the functions length, mean, and sd on the value of "change" for each group,
  3. # broken down by sex + condition
  4. cdata <- summaryBy(change ~ sex + condition, data=data, FUN=c(length,mean,sd))
  5. cdata
  6. #> sex condition change.length change.mean change.sd
  7. #> 1 F aspirin 5 -3.420000 0.8642916
  8. #> 2 F placebo 12 -2.058333 0.5247655
  9. #> 3 M aspirin 9 -5.411111 1.1307569
  10. #> 4 M placebo 4 -0.975000 0.7804913
  11. # Rename column change.length to just N
  12. names(cdata)[names(cdata)=="change.length"] <- "N"
  13. # Calculate standard error of the mean
  14. cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
  15. cdata
  16. #> sex condition N change.mean change.sd change.se
  17. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230
  18. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867
  19. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190
  20. #> 4 M placebo 4 -0.975000 0.7804913 0.3902456

Note that if you have any within-subjects variables, these standard error values may not be useful for comparing groups. See ../../Graphs/Plotting means and error bars (ggplot2) for information on how to make error bars for graphs with within-subjects variables.

Handling missing data

If there are NA’s in the data, you need to pass the flag na.rm=TRUE to the functions. Normally you could pass it to summaryBy() and it would get passed to each of the functions called, but length() does not recognize it and so it won’t work. One way around it is to define a new length function that handles the NA’s.

  1. # New version of length which can handle NA's: if na.rm==T, don't count them
  2. length2 <- function (x, na.rm=FALSE) {
  3. if (na.rm) sum(!is.na(x))
  4. else length(x)
  5. }
  6. # Put some NA's in the data
  7. dataNA <- data
  8. dataNA$change[11:14] <- NA
  9. cdataNA <- summaryBy(change ~ sex + condition, data=dataNA,
  10. FUN=c(length2, mean, sd), na.rm=TRUE)
  11. cdataNA
  12. #> sex condition change.length2 change.mean change.sd
  13. #> 1 F aspirin 4 -3.425000 0.9979145
  14. #> 2 F placebo 12 -2.058333 0.5247655
  15. #> 3 M aspirin 7 -5.142857 1.0674848
  16. #> 4 M placebo 3 -1.300000 0.5291503
  17. # Now, do the same as before

A function for mean, count, standard deviation, standard error of the mean, and confidence interval

Instead of manually specifying all the values you want and then calculating the standard error, as shown above, this function will handle all of those details. It will do all the things described here:

  • Find the mean, standard deviation, and count (N)
  • Find the standard error of the mean (again, this may not be what you want if you are collapsing over a within-subject variable. See ../../Graphs/Plotting means and error bars (ggplot2) for information on how to make error bars for graphs with within-subjects variables.)
  • Find a 95% confidence interval (or other value, if desired)
  • Rename the columns so that the resulting data frame is easier to work with

To use, put this function in your code and call it as demonstrated below.

  1. ## Summarizes data.
  2. ## Gives count, mean, standard deviation, standard error of the mean, and confidence
  3. ## interval (default 95%).
  4. ## data: a data frame.
  5. ## measurevar: the name of a column that contains the variable to be summariezed
  6. ## groupvars: a vector containing names of columns that contain grouping variables
  7. ## na.rm: a boolean that indicates whether to ignore NA's
  8. ## conf.interval: the percent range of the confidence interval (default is 95%)
  9. summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE, conf.interval=.95) {
  10. library(doBy)
  11. # New version of length which can handle NA's: if na.rm==T, don't count them
  12. length2 <- function (x, na.rm=FALSE) {
  13. if (na.rm) sum(!is.na(x))
  14. else length(x)
  15. }
  16. # Collapse the data
  17. formula <- as.formula(paste(measurevar, paste(groupvars, collapse=" + "), sep=" ~ "))
  18. datac <- summaryBy(formula, data=data, FUN=c(length2,mean,sd), na.rm=na.rm)
  19. # Rename columns
  20. names(datac)[ names(datac) == paste(measurevar, ".mean", sep="") ] <- measurevar
  21. names(datac)[ names(datac) == paste(measurevar, ".sd", sep="") ] <- "sd"
  22. names(datac)[ names(datac) == paste(measurevar, ".length2", sep="") ] <- "N"
  23. datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
  24. # Confidence interval multiplier for standard error
  25. # Calculate t-statistic for confidence interval:
  26. # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
  27. ciMult <- qt(conf.interval/2 + .5, datac$N-1)
  28. datac$ci <- datac$se * ciMult
  29. return(datac)
  30. }

Example usage (with 95% confidence interval). Instead of doing all the steps manually, as done previously, the summarySE function does it all in one step:

  1. summarySE(data, measurevar="change", groupvars=c("sex","condition"))
  2. #> sex condition N change sd se ci
  3. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
  4. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  5. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
  6. #> 4 M placebo 4 -0.975000 0.7804913 0.3902456 1.2419358
  7. # With a data set with NA's, use na.rm=TRUE
  8. summarySE(dataNA, measurevar="change", groupvars=c("sex","condition"), na.rm=TRUE)
  9. #> sex condition N change sd se ci
  10. #> 1 F aspirin 4 -3.425000 0.9979145 0.4989572 1.5879046
  11. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  12. #> 3 M aspirin 7 -5.142857 1.0674848 0.4034713 0.9872588
  13. #> 4 M placebo 3 -1.300000 0.5291503 0.3055050 1.3144821

Filling empty combinations with zeros

Sometimes there will be empty combinations of factors in the summary data frame – that is, combinations of factors that are possible, but don’t actually occur in the original data frame. It is often useful to automatically fill in those combinations in the summary data frame with zeros.

This function will fill in those missing combinations with zeros:

  1. fillMissingCombs <- function(df, factors, measures) {
  2. # Make a list of the combinations of factor levels
  3. levelList <- list()
  4. for (f in factors) { levelList[[f]] <- levels(df[,f]) }
  5. fullFactors <- expand.grid(levelList)
  6. dfFull <- merge(fullFactors, df, all.x=TRUE)
  7. # Wherever there is an NA in the measure vars, replace with 0
  8. for (m in measures) {
  9. dfFull[is.na(dfFull[,m]), m] <- 0
  10. }
  11. return(dfFull)
  12. }

Example usage:

  1. # First remove some all Male+Placebo entries from the data
  2. dataSub <- subset(data, !(sex=="M" & condition=="placebo"))
  3. # If we summarize the data, there will be a missing row for Male+Placebo,
  4. # since there were no cases with this combination.
  5. cdataSub <- summarySE(dataSub, measurevar="change", groupvars=c("sex", "condition"))
  6. cdataSub
  7. #> sex condition N change sd se ci
  8. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
  9. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  10. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
  11. # This will fill in the missing combinations with zeros
  12. fillMissingCombs(cdataSub, factors=c("sex","condition"), measures=c("N","change","sd","se","ci"))
  13. #> sex condition N change sd se ci
  14. #> 1 F aspirin 5 -3.420000 0.8642916 0.3865230 1.0731598
  15. #> 2 F placebo 12 -2.058333 0.5247655 0.1514867 0.3334201
  16. #> 3 M aspirin 9 -5.411111 1.1307569 0.3769190 0.8691767
  17. #> 4 M placebo 0 0.000000 0.0000000 0.0000000 0.0000000

Using aggregate

The aggregate function is more difficult to use, but it is included in the base R installation and does not require the installation of another package.

  1. # Get a count of number of subjects in each category (sex*condition)
  2. cdata <- aggregate(data["subject"], by=data[c("sex","condition")], FUN=length)
  3. cdata
  4. #> sex condition subject
  5. #> 1 F aspirin 5
  6. #> 2 M aspirin 9
  7. #> 3 F placebo 12
  8. #> 4 M placebo 4
  9. # Rename "subject" column to "N"
  10. names(cdata)[names(cdata)=="subject"] <- "N"
  11. cdata
  12. #> sex condition N
  13. #> 1 F aspirin 5
  14. #> 2 M aspirin 9
  15. #> 3 F placebo 12
  16. #> 4 M placebo 4
  17. # Sort by sex first
  18. cdata <- cdata[order(cdata$sex),]
  19. cdata
  20. #> sex condition N
  21. #> 1 F aspirin 5
  22. #> 3 F placebo 12
  23. #> 2 M aspirin 9
  24. #> 4 M placebo 4
  25. # We also keep the __before__ and __after__ columns:
  26. # Get the average effect size by sex and condition
  27. cdata.means <- aggregate(data[c("before","after","change")],
  28. by = data[c("sex","condition")], FUN=mean)
  29. cdata.means
  30. #> sex condition before after change
  31. #> 1 F aspirin 11.06000 7.640000 -3.420000
  32. #> 2 M aspirin 11.26667 5.855556 -5.411111
  33. #> 3 F placebo 10.13333 8.075000 -2.058333
  34. #> 4 M placebo 11.47500 10.500000 -0.975000
  35. # Merge the data frames
  36. cdata <- merge(cdata, cdata.means)
  37. cdata
  38. #> sex condition N before after change
  39. #> 1 F aspirin 5 11.06000 7.640000 -3.420000
  40. #> 2 F placebo 12 10.13333 8.075000 -2.058333
  41. #> 3 M aspirin 9 11.26667 5.855556 -5.411111
  42. #> 4 M placebo 4 11.47500 10.500000 -0.975000
  43. # Get the sample (n-1) standard deviation for "change"
  44. cdata.sd <- aggregate(data["change"],
  45. by = data[c("sex","condition")], FUN=sd)
  46. # Rename the column to change.sd
  47. names(cdata.sd)[names(cdata.sd)=="change"] <- "change.sd"
  48. cdata.sd
  49. #> sex condition change.sd
  50. #> 1 F aspirin 0.8642916
  51. #> 2 M aspirin 1.1307569
  52. #> 3 F placebo 0.5247655
  53. #> 4 M placebo 0.7804913
  54. # Merge
  55. cdata <- merge(cdata, cdata.sd)
  56. cdata
  57. #> sex condition N before after change change.sd
  58. #> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916
  59. #> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655
  60. #> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569
  61. #> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913
  62. # Calculate standard error of the mean
  63. cdata$change.se <- cdata$change.sd / sqrt(cdata$N)
  64. cdata
  65. #> sex condition N before after change change.sd change.se
  66. #> 1 F aspirin 5 11.06000 7.640000 -3.420000 0.8642916 0.3865230
  67. #> 2 F placebo 12 10.13333 8.075000 -2.058333 0.5247655 0.1514867
  68. #> 3 M aspirin 9 11.26667 5.855556 -5.411111 1.1307569 0.3769190
  69. #> 4 M placebo 4 11.47500 10.500000 -0.975000 0.7804913 0.3902456

If you have NA’s in your data and wish to skip them, use na.rm=TRUE:

  1. cdata.means <- aggregate(data[c("before","after","change")],
  2. by = data[c("sex","condition")],
  3. FUN=mean, na.rm=TRUE)
  4. cdata.means
  5. #> sex condition before after change
  6. #> 1 F aspirin 11.06000 7.640000 -3.420000
  7. #> 2 M aspirin 11.26667 5.855556 -5.411111
  8. #> 3 F placebo 10.13333 8.075000 -2.058333
  9. #> 4 M placebo 11.47500 10.500000 -0.975000