4.8 Groupby and Combine

In the R programming language, Wickham (2011) has popularized the so-called split-apply-combine strategy for data transformations. In essence, this strategy splits a dataset into distinct groups, applies one or more functions to each group, and then combines the result. DataFrames.jl fully supports split-apply-combine. We will use the student grades example like before. Suppose that we want to know each student’s mean grade:

  1. function all_grades()
  2. df1 = grades_2020()
  3. df1 = select(df1, :name, :grade_2020 => :grade)
  4. df2 = grades_2021()
  5. df2 = select(df2, :name, :grade_2021 => :grade)
  6. rename_bob2(data_col) = replace.(data_col, "Bob 2" => "Bob")
  7. df2 = transform(df2, :name => rename_bob2 => :name)
  8. return vcat(df1, df2)
  9. end
  10. all_grades()
namegrade
Sally1.0
Bob5.0
Alice8.5
Hank4.0
Bob9.5
Sally9.5
Hank6.0

The strategy is to split the dataset into distinct students, apply the mean function to each student, and combine the result.

The split is called groupby and we give as second argument the column ID that we want to split the dataset into:

  1. groupby(all_grades(), :name)
  1. GroupedDataFrame with 4 groups based on key: name
  2. Group 1 (2 rows): name = "Sally"
  3. Row name grade
  4. String Float64
  5. ─────┼─────────────────
  6. 1 Sally 1.0
  7. 2 Sally 9.5
  8. Group 2 (2 rows): name = "Bob"
  9. Row name grade
  10. String Float64
  11. ─────┼─────────────────
  12. 1 Bob 5.0
  13. 2 Bob 9.5
  14. Group 3 (1 row): name = "Alice"
  15. Row name grade
  16. String Float64
  17. ─────┼─────────────────
  18. 1 Alice 8.5
  19. Group 4 (2 rows): name = "Hank"
  20. Row name grade
  21. String Float64
  22. ─────┼─────────────────
  23. 1 Hank 4.0
  24. 2 Hank 6.0

We apply the mean function from Julia’s standard library Statistics module:

  1. using Statistics

To apply this function, use the combine function:

  1. gdf = groupby(all_grades(), :name)
  2. combine(gdf, :grade => mean)
namegrade_mean
Sally5.25
Bob7.25
Alice8.5
Hank5.0

Imagine having to do this without the groupby and combine functions. We would need to loop over our data to split it up into groups, then loop over each split to apply a function, and finally loop over each group to gather the final result. Therefore, the split-apply-combine technique is a great one to know.

4.8.1 Multiple Source Columns

But what if we want to apply a function to multiple columns of our dataset?

  1. group = [:A, :A, :B, :B]
  2. X = 1:4
  3. Y = 5:8
  4. df = DataFrame(; group, X, Y)
groupXY
A15
A26
B37
B48

This is accomplished in a similar manner:

  1. gdf = groupby(df, :group)
  2. combine(gdf, [:X, :Y] .=> mean; renamecols=false)
groupXY
A1.55.5
B3.57.5

Note that we’ve used the dot . operator before the right arrow => to indicate that the mean has to be applied to multiple source columns [:X, :Y].

To use composable functions, a simple way is to create a function that does the intended composable transformations. For instance, for a series of values, let’s first take the mean followed by round to a whole number (also known as an integer Int):

  1. gdf = groupby(df, :group)
  2. rounded_mean(data_col) = round(Int, mean(data_col))
  3. combine(gdf, [:X, :Y] .=> rounded_mean; renamecols=false)
groupXY
A26
B48

4.8 Groupby and Combine - 图1 Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso