4.6 Join

At the start of this chapter, we showed multiple tables and raised questions also related to multiple tables. However, we haven’t talked about combining tables yet, which we will do in this section. In DataFrames.jl, combining multiple tables is done via joins. Joins are extremely powerful, but it might take a while to wrap your head around them. It is not necessary to know the joins below by heart, because the DataFrames.jl documentation, along with this book, will list them for you. But, it’s essential to know that joins exist. If you ever find yourself looping over rows in a DataFrame and comparing it with other data, then you probably need one of the joins below.

In Section 4, we’ve introduced the grades for 2020 with grades_2020:

  1. grades_2020()
namegrade_2020
Sally1.0
Bob5.0
Alice8.5
Hank4.0

Now, we’re going to combine grades_2020 with grades from 2021:

  1. grades_2021()
namegrade_2021
Bob 29.5
Sally9.5
Hank6.0

To do this, we are going to use joins. DataFrames.jl lists no less than seven kinds of join. This might seem daunting at first, but hang on because they are all useful and we will showcase them all.

4.6.1 innerjoin

This first is innerjoin. Suppose that we have two datasets A and B with respectively columns A_1, A_2, ..., A_n and B_1, B_2, ..., B_m and one of the columns has the same name, say A_1 and B_1 are both called :id. Then, the inner join on :id will go through all the elements in A_1 and compare it to the elements in B_1. If the elements are the same, then it will add all the information from A_2, ..., A_n and B_2, ..., B_m after the :id column.

Okay, so no worries if you didn’t get this description. The result on the grades datasets looks like this:

  1. innerjoin(grades_2020(), grades_2021(); on=:name)
namegrade_2020grade_2021
Sally1.09.5
Hank4.06.0

Note that only “Sally” and “Hank” are in both datasets. The name inner join makes sense since, in mathematics, the set intersection is defined by “all elements in \(A\), that are also in \(B\), or all elements in \(B\) that are also in \(A\).”

4.6.2 outerjoin

Maybe you’re now thinking “aha, if we have an inner, then we probably also have an outer.” Yes, you’ve guessed right!

The outerjoin is much less strict than the innerjoin and just takes any row it can find which contains a name in at least one of the datasets:

  1. outerjoin(grades_2020(), grades_2021(); on=:name)
namegrade_2020grade_2021
Sally1.09.5
Hank4.06.0
Bob5.0missing
Alice8.5missing
Bob 2missing9.5

So, this method can create missing data even though none of the original datasets had missing values.

4.6.3 crossjoin

We can get even more missing data if we use the crossjoin. This gives the Cartesian product of the rows, which is basically multiplication of rows, that is, for every row create a combination with any other row:

  1. crossjoin(grades_2020(), grades_2021(); on=:id)
  1. MethodError: no method matching crossjoin(::DataFrame, ::DataFrame; on=:id)
  2. Closest candidates are:
  3. crossjoin(::DataFrames.AbstractDataFrame, ::DataFrames.AbstractDataFrame; makeunique) at ~/.julia/packages/DataFrames/dgZn3/src/join/composer.jl:1412 got unsupported keyword argument "on"
  4. crossjoin(::DataFrames.AbstractDataFrame, ::DataFrames.AbstractDataFrame, !Matched::DataFrames.AbstractDataFrame...; makeunique) at ~/.julia/packages/DataFrames/dgZn3/src/join/composer.jl:1434 got unsupported keyword argument "on"
  5. ...

Oops. Since crossjoin doesn’t take the elements in the row into account, we don’t need to specify the on argument for what we want to join:

  1. crossjoin(grades_2020(), grades_2021())
  1. ArgumentError: Duplicate variable names: :name. Pass makeunique=true to make them unique using a suffix automatically.
  2. Stacktrace:
  3. [1] add_names(ind::DataFrames.Index, add_ind::DataFrames.Index; makeunique::Bool)
  4. @ DataFrames ~/.julia/packages/DataFrames/dgZn3/src/other/index.jl:434
  5. [2] merge!(x::DataFrames.Index, y::DataFrames.Index; makeunique::Bool)
  6. ...

Oops again. This is a very common error with DataFrames and joins. The tables for the 2020 and 2021 grades have a duplicate column name, namely :name. Like before, the error that DataFrames.jl outputs shows a simple suggestion that might fix the issue. We can just pass makeunique=true to solve this:

  1. crossjoin(grades_2020(), grades_2021(); makeunique=true)
namegrade_2020name_1grade_2021
Sally1.0Bob 29.5
Sally1.0Sally9.5
Sally1.0Hank6.0
Bob5.0Bob 29.5
Bob5.0Sally9.5
Bob5.0Hank6.0
Alice8.5Bob 29.5
Alice8.5Sally9.5
Alice8.5Hank6.0
Hank4.0Bob 29.5
Hank4.0Sally9.5
Hank4.0Hank6.0

So, now, we have one row for each grade from everyone in grades 2020 and grades 2021 datasets. For direct queries, such as “who has the highest grade?” the Cartesian product is usually not so useful, but for “statistical” queries, it can be.

4.6.4 leftjoin and rightjoin

More useful for scientific data projects are the leftjoin and rightjoin. The left join gives all the elements in the left DataFrame:

  1. leftjoin(grades_2020(), grades_2021(); on=:name)
namegrade_2020grade_2021
Sally1.09.5
Hank4.06.0
Bob5.0missing
Alice8.5missing

Here, grades for “Bob” and “Alice” were missing in the grades 2021 table, so that’s why there are also missing elements. The right join does sort of the opposite:

  1. rightjoin(grades_2020(), grades_2021(); on=:name)
namegrade_2020grade_2021
Sally1.09.5
Hank4.06.0
Bob 2missing9.5

Now, grades in 2020 are missing.

Note that leftjoin(A, B) != rightjoin(B, A), because the order of the columns will differ. For example, compare the output below to the previous output:

  1. leftjoin(grades_2021(), grades_2020(); on=:name)
namegrade_2021grade_2020
Sally9.51.0
Hank6.04.0
Bob 29.5missing

4.6.5 semijoin and antijoin

Lastly, we have the semijoin and antijoin.

The semi join is even more restrictive than the inner join. It returns only the elements from the left DataFrame which are in both DataFrames. This is like a combination of the left join with the inner join.

  1. semijoin(grades_2020(), grades_2021(); on=:name)
namegrade_2020
Sally1.0
Hank4.0

The opposite of the semi join is the anti join. It returns only the elements from the left DataFrame which are not in the right DataFrame:

  1. antijoin(grades_2020(), grades_2021(); on=:name)
namegrade_2020
Bob5.0
Alice8.5

4.6 Join - 图1 Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso