Converting data between wide and long format

Problem

You want to do convert data from a wide format to a long format.

Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.

Solution

There are two sets of methods that are explained below:

  • gather() and spread() from the tidyr package. This is a newer interface to the reshape2 package.
  • melt() and dcast() from the reshape2 package.

There are a number of other methods which aren’t covered here, since they are not as easy to use:

  • The reshape() function, which is confusingly not part of the reshape2 package; it is part of the base install of R.
  • stack() and unstack()

Sample data

These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below.

  1. olddata_wide <- read.table(header=TRUE, text='
  2. subject sex control cond1 cond2
  3. 1 M 7.9 12.3 10.7
  4. 2 F 6.3 10.6 11.1
  5. 3 F 9.5 13.1 13.8
  6. 4 M 11.5 13.4 12.9
  7. ')
  8. # Make sure the subject column is a factor
  9. olddata_wide$subject <- factor(olddata_wide$subject)
  1. olddata_long <- read.table(header=TRUE, text='
  2. subject sex condition measurement
  3. 1 M control 7.9
  4. 1 M cond1 12.3
  5. 1 M cond2 10.7
  6. 2 F control 6.3
  7. 2 F cond1 10.6
  8. 2 F cond2 11.1
  9. 3 F control 9.5
  10. 3 F cond1 13.1
  11. 3 F cond2 13.8
  12. 4 M control 11.5
  13. 4 M cond1 13.4
  14. 4 M cond2 12.9
  15. ')
  16. # Make sure the subject column is a factor
  17. olddata_long$subject <- factor(olddata_long$subject)

tidyr

From wide to long

Use gather:

  1. olddata_wide
  2. #> subject sex control cond1 cond2
  3. #> 1 1 M 7.9 12.3 10.7
  4. #> 2 2 F 6.3 10.6 11.1
  5. #> 3 3 F 9.5 13.1 13.8
  6. #> 4 4 M 11.5 13.4 12.9
  7. library(tidyr)
  8. # The arguments to gather():
  9. # - data: Data object
  10. # - key: Name of new key column (made from names of data columns)
  11. # - value: Name of new value column
  12. # - ...: Names of source columns that contain values
  13. # - factor_key: Treat the new key column as a factor (instead of character vector)
  14. data_long <- gather(olddata_wide, condition, measurement, control:cond2, factor_key=TRUE)
  15. data_long
  16. #> subject sex condition measurement
  17. #> 1 1 M control 7.9
  18. #> 2 2 F control 6.3
  19. #> 3 3 F control 9.5
  20. #> 4 4 M control 11.5
  21. #> 5 1 M cond1 12.3
  22. #> 6 2 F cond1 10.6
  23. #> 7 3 F cond1 13.1
  24. #> 8 4 M cond1 13.4
  25. #> 9 1 M cond2 10.7
  26. #> 10 2 F cond2 11.1
  27. #> 11 3 F cond2 13.8
  28. #> 12 4 M cond2 12.9

In this example, the source columns that are gathered are specified with control:cond2. This means to use all the columns, positionally, between control and cond2. Another way of doing it is to name the columns individually, as in:

  1. gather(olddata_wide, condition, measurement, control, cond1, cond2)

If you need to use gather() programmatically, you may need to use variables containing column names. To do this, you should use the gather_() function instead, which takes strings instead of bare (unquoted) column names.

  1. keycol <- "condition"
  2. valuecol <- "measurement"
  3. gathercols <- c("control", "cond1", "cond2")
  4. gather_(olddata_wide, keycol, valuecol, gathercols)

Optional: Rename the factor levels of the variable column, and sort.

  1. # Rename factor names from "cond1" and "cond2" to "first" and "second"
  2. levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
  3. levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
  4. # Sort by subject first, then by condition
  5. data_long <- data_long[order(data_long$subject, data_long$condition), ]
  6. data_long
  7. #> subject sex condition measurement
  8. #> 1 1 M control 7.9
  9. #> 5 1 M first 12.3
  10. #> 9 1 M second 10.7
  11. #> 2 2 F control 6.3
  12. #> 6 2 F first 10.6
  13. #> 10 2 F second 11.1
  14. #> 3 3 F control 9.5
  15. #> 7 3 F first 13.1
  16. #> 11 3 F second 13.8
  17. #> 4 4 M control 11.5
  18. #> 8 4 M first 13.4
  19. #> 12 4 M second 12.9

From long to wide

Use spread:

  1. olddata_long
  2. #> subject sex condition measurement
  3. #> 1 1 M control 7.9
  4. #> 2 1 M cond1 12.3
  5. #> 3 1 M cond2 10.7
  6. #> 4 2 F control 6.3
  7. #> 5 2 F cond1 10.6
  8. #> 6 2 F cond2 11.1
  9. #> 7 3 F control 9.5
  10. #> 8 3 F cond1 13.1
  11. #> 9 3 F cond2 13.8
  12. #> 10 4 M control 11.5
  13. #> 11 4 M cond1 13.4
  14. #> 12 4 M cond2 12.9
  15. library(tidyr)
  16. # The arguments to spread():
  17. # - data: Data object
  18. # - key: Name of column containing the new column names
  19. # - value: Name of column containing values
  20. data_wide <- spread(olddata_long, condition, measurement)
  21. data_wide
  22. #> subject sex cond1 cond2 control
  23. #> 1 1 M 12.3 10.7 7.9
  24. #> 2 2 F 10.6 11.1 6.3
  25. #> 3 3 F 13.1 13.8 9.5
  26. #> 4 4 M 13.4 12.9 11.5

Optional: A few things to make the data look nicer.

  1. # Rename cond1 to first, and cond2 to second
  2. names(data_wide)[names(data_wide)=="cond1"] <- "first"
  3. names(data_wide)[names(data_wide)=="cond2"] <- "second"
  4. # Reorder the columns
  5. data_wide <- data_wide[, c(1,2,5,3,4)]
  6. data_wide
  7. #> subject sex control first second
  8. #> 1 1 M 7.9 12.3 10.7
  9. #> 2 2 F 6.3 10.6 11.1
  10. #> 3 3 F 9.5 13.1 13.8
  11. #> 4 4 M 11.5 13.4 12.9

The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.

reshape2

From wide to long

Use melt:

  1. olddata_wide
  2. #> subject sex control cond1 cond2
  3. #> 1 1 M 7.9 12.3 10.7
  4. #> 2 2 F 6.3 10.6 11.1
  5. #> 3 3 F 9.5 13.1 13.8
  6. #> 4 4 M 11.5 13.4 12.9
  7. library(reshape2)
  8. # Specify id.vars: the variables to keep but not split apart on
  9. melt(olddata_wide, id.vars=c("subject", "sex"))
  10. #> subject sex variable value
  11. #> 1 1 M control 7.9
  12. #> 2 2 F control 6.3
  13. #> 3 3 F control 9.5
  14. #> 4 4 M control 11.5
  15. #> 5 1 M cond1 12.3
  16. #> 6 2 F cond1 10.6
  17. #> 7 3 F cond1 13.1
  18. #> 8 4 M cond1 13.4
  19. #> 9 1 M cond2 10.7
  20. #> 10 2 F cond2 11.1
  21. #> 11 3 F cond2 13.8
  22. #> 12 4 M cond2 12.9

There are options for melt that can make the output a little easier to work with:

  1. data_long <- melt(olddata_wide,
  2. # ID variables - all the variables to keep but not split apart on
  3. id.vars=c("subject", "sex"),
  4. # The source columns
  5. measure.vars=c("control", "cond1", "cond2" ),
  6. # Name of the destination column that will identify the original
  7. # column that the measurement came from
  8. variable.name="condition",
  9. value.name="measurement"
  10. )
  11. data_long
  12. #> subject sex condition measurement
  13. #> 1 1 M control 7.9
  14. #> 2 2 F control 6.3
  15. #> 3 3 F control 9.5
  16. #> 4 4 M control 11.5
  17. #> 5 1 M cond1 12.3
  18. #> 6 2 F cond1 10.6
  19. #> 7 3 F cond1 13.1
  20. #> 8 4 M cond1 13.4
  21. #> 9 1 M cond2 10.7
  22. #> 10 2 F cond2 11.1
  23. #> 11 3 F cond2 13.8
  24. #> 12 4 M cond2 12.9

If you leave out the measure.vars, melt will automatically use all the other variables as the id.vars. The reverse is true if you leave out id.vars.

If you don’t specify variable.name, it will name that column "variable", and if you leave out value.name, it will name that column "measurement".

Optional: Rename the factor levels of the variable column.

  1. # Rename factor names from "cond1" and "cond2" to "first" and "second"
  2. levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
  3. levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
  4. # Sort by subject first, then by condition
  5. data_long <- data_long[ order(data_long$subject, data_long$condition), ]
  6. data_long
  7. #> subject sex condition measurement
  8. #> 1 1 M control 7.9
  9. #> 5 1 M first 12.3
  10. #> 9 1 M second 10.7
  11. #> 2 2 F control 6.3
  12. #> 6 2 F first 10.6
  13. #> 10 2 F second 11.1
  14. #> 3 3 F control 9.5
  15. #> 7 3 F first 13.1
  16. #> 11 3 F second 13.8
  17. #> 4 4 M control 11.5
  18. #> 8 4 M first 13.4
  19. #> 12 4 M second 12.9

From long to wide

The following code uses dcast to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast instead.

  1. olddata_long
  2. #> subject sex condition measurement
  3. #> 1 1 M control 7.9
  4. #> 2 1 M cond1 12.3
  5. #> 3 1 M cond2 10.7
  6. #> 4 2 F control 6.3
  7. #> 5 2 F cond1 10.6
  8. #> 6 2 F cond2 11.1
  9. #> 7 3 F control 9.5
  10. #> 8 3 F cond1 13.1
  11. #> 9 3 F cond2 13.8
  12. #> 10 4 M control 11.5
  13. #> 11 4 M cond1 13.4
  14. #> 12 4 M cond2 12.9
  15. # From the source:
  16. # "subject" and "sex" are columns we want to keep the same
  17. # "condition" is the column that contains the names of the new column to put things in
  18. # "measurement" holds the measurements
  19. library(reshape2)
  20. data_wide <- dcast(olddata_long, subject + sex ~ condition, value.var="measurement")
  21. data_wide
  22. #> subject sex cond1 cond2 control
  23. #> 1 1 M 12.3 10.7 7.9
  24. #> 2 2 F 10.6 11.1 6.3
  25. #> 3 3 F 13.1 13.8 9.5
  26. #> 4 4 M 13.4 12.9 11.5

Optional: A few things to make the data look nicer.

  1. # Rename cond1 to first, and cond2 to second
  2. names(data_wide)[names(data_wide)=="cond1"] <- "first"
  3. names(data_wide)[names(data_wide)=="cond2"] <- "second"
  4. # Reorder the columns
  5. data_wide <- data_wide[, c(1,2,5,3,4)]
  6. data_wide
  7. #> subject sex control first second
  8. #> 1 1 M 7.9 12.3 10.7
  9. #> 2 2 F 6.3 10.6 11.1
  10. #> 3 3 F 9.5 13.1 13.8
  11. #> 4 4 M 11.5 13.4 12.9

The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.