十、 Categorical
从 0.15 版本开始,pandas 可以在DataFrame
中支持 Categorical 类型的数据,详细 介绍参看:Categorical 简介和API documentation。
In [127]: df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
1、 将原始的grade
转换为 Categorical 数据类型:
In [128]: df["grade"] = df["raw_grade"].astype("category")
In [129]: df["grade"]
Out[129]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
2、 将 Categorical 类型数据重命名为更有意义的名称:
In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]
3、 对类别进行重新排序,增加缺失的类别:
In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
In [132]: df["grade"]
Out[132]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
4、 排序是按照 Categorical 的顺序进行的而不是按照字典顺序进行:
In [133]: df.sort_values(by="grade")
Out[133]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
5、 对 Categorical 列进行排序时存在空的类别:
In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64