4.2 Index and Summarize
Let’s go back to the example grades_2020()
data defined before:
grades_2020()
name | grade_2020 |
---|---|
Sally | 1.0 |
Bob | 5.0 |
Alice | 8.5 |
Hank | 4.0 |
To retrieve a vector for name
, we can access the DataFrame
with the .
, as we did previously with struct
s in Section 3:
function names_grades1()
df = grades_2020()
df.name
end
JDS.names_grades1()
["Sally", "Bob", "Alice", "Hank"]
or we can index a DataFrame
much like an Array
with symbols and special characters. The second index is the column indexing:
function names_grades2()
df = grades_2020()
df[!, :name]
end
JDS.names_grades2()
["Sally", "Bob", "Alice", "Hank"]
Note that df.name
is exactly the same as df[!, :name]
, which you can verify yourself by doing:
julia> df = DataFrame(id=[1]);
julia> @edit df.name
In both cases, it gives you the column :name
. There also exists df[:, :name]
which copies the column :name
. In most cases, df[!, :name]
is the best bet since it is more versatile and does an in-place modification.
For any row, say the second row, we can use the first index as row indexing:
df = grades_2020()
df[2, :]
name | grade_2020 |
---|---|
Bob | 5.0 |
or create a function to give us any row i
we want:
function grade_2020(i::Int)
df = grades_2020()
df[i, :]
end
JDS.grade_2020(2)
name | grade_2020 |
---|---|
Bob | 5.0 |
We can also get only names
for the first 2 rows using slicing (again similar to an Array
):
grades_indexing(df) = df[1:2, :name]
JDS.grades_indexing(grades_2020())
["Sally", "Bob"]
If we assume that all names in the table are unique, we can also write a function to obtain the grade for a person via their name
. To do so, we convert the table back to one of Julia’s basic data structures (see Section 3.3) which is capable of creating mappings, namely Dict
s:
function grade_2020(name::String)
df = grades_2020()
dic = Dict(zip(df.name, df.grade_2020))
dic[name]
end
grade_2020("Bob")
5.0
which works because zip
loops through df.name
and df.grade_2020
at the same time like a “zipper”:
df = grades_2020()
collect(zip(df.name, df.grade_2020))
("Sally", 1.0)
("Bob", 5.0)
("Alice", 8.5)
("Hank", 4.0)
However, converting a DataFrame
to a Dict
is only useful when the elements are unique. Generally that is not the case and that’s why we need to learn how to filter
a DataFrame
.
Support this project
CC BY-NC-SA 4.0 Jose Storopoli, Rik Huijzer, Lazaro Alonso