3.1.1.2 panda data-frame
我们将会在来自pandas模块的pandas.DataFrame中存储和操作这个数据。它是电子表格程序在Python中的一个等价物。它与2D numpy
数据的区别在于列带有名字,可以在列中存储混合的数据类型,并且有精妙的选择和透视表机制。
3.1.1.2.1 创建dataframes: 读取数据文件或转化数组
从CSV文件读取: 使用上面的CSV文件,给出了大脑大小重量和IQ (Willerman et al. 1991) 的观察值 , 数据混合了数量值和类型值:
In [3]:
import pandas
data = pandas.read_csv('examples/brain_size.csv', sep=';', na_values=".")
data
Out[3]:
Unnamed: 0 | Gender | FSIQ | VIQ | PIQ | Weight | Height | MRI_Count | |
---|---|---|---|---|---|---|---|---|
0 | 1 | Female | 133 | 132 | 124 | 118 | 64.5 | 816932 |
1 | 2 | Male | 140 | 150 | 124 | NaN | 72.5 | 1001121 |
2 | 3 | Male | 139 | 123 | 150 | 143 | 73.3 | 1038437 |
3 | 4 | Male | 133 | 129 | 128 | 172 | 68.8 | 965353 |
4 | 5 | Female | 137 | 132 | 134 | 147 | 65.0 | 951545 |
5 | 6 | Female | 99 | 90 | 110 | 146 | 69.0 | 928799 |
6 | 7 | Female | 138 | 136 | 131 | 138 | 64.5 | 991305 |
7 | 8 | Female | 92 | 90 | 98 | 175 | 66.0 | 854258 |
8 | 9 | Male | 89 | 93 | 84 | 134 | 66.3 | 904858 |
9 | 10 | Male | 133 | 114 | 147 | 172 | 68.8 | 955466 |
10 | 11 | Female | 132 | 129 | 124 | 118 | 64.5 | 833868 |
11 | 12 | Male | 141 | 150 | 128 | 151 | 70.0 | 1079549 |
12 | 13 | Male | 135 | 129 | 124 | 155 | 69.0 | 924059 |
13 | 14 | Female | 140 | 120 | 147 | 155 | 70.5 | 856472 |
14 | 15 | Female | 96 | 100 | 90 | 146 | 66.0 | 878897 |
15 | 16 | Female | 83 | 71 | 96 | 135 | 68.0 | 865363 |
16 | 17 | Female | 132 | 132 | 120 | 127 | 68.5 | 852244 |
17 | 18 | Male | 100 | 96 | 102 | 178 | 73.5 | 945088 |
18 | 19 | Female | 101 | 112 | 84 | 136 | 66.3 | 808020 |
19 | 20 | Male | 80 | 77 | 86 | 180 | 70.0 | 889083 |
20 | 21 | Male | 83 | 83 | 86 | NaN | NaN | 892420 |
21 | 22 | Male | 97 | 107 | 84 | 186 | 76.5 | 905940 |
22 | 23 | Female | 135 | 129 | 134 | 122 | 62.0 | 790619 |
23 | 24 | Male | 139 | 145 | 128 | 132 | 68.0 | 955003 |
24 | 25 | Female | 91 | 86 | 102 | 114 | 63.0 | 831772 |
25 | 26 | Male | 141 | 145 | 131 | 171 | 72.0 | 935494 |
26 | 27 | Female | 85 | 90 | 84 | 140 | 68.0 | 798612 |
27 | 28 | Male | 103 | 96 | 110 | 187 | 77.0 | 1062462 |
28 | 29 | Female | 77 | 83 | 72 | 106 | 63.0 | 793549 |
29 | 30 | Female | 130 | 126 | 124 | 159 | 66.5 | 866662 |
30 | 31 | Female | 133 | 126 | 132 | 127 | 62.5 | 857782 |
31 | 32 | Male | 144 | 145 | 137 | 191 | 67.0 | 949589 |
32 | 33 | Male | 103 | 96 | 110 | 192 | 75.5 | 997925 |
33 | 34 | Male | 90 | 96 | 86 | 181 | 69.0 | 879987 |
34 | 35 | Female | 83 | 90 | 81 | 143 | 66.5 | 834344 |
35 | 36 | Female | 133 | 129 | 128 | 153 | 66.5 | 948066 |
36 | 37 | Male | 140 | 150 | 124 | 144 | 70.5 | 949395 |
37 | 38 | Female | 88 | 86 | 94 | 139 | 64.5 | 893983 |
38 | 39 | Male | 81 | 90 | 74 | 148 | 74.0 | 930016 |
39 | 40 | Male | 89 | 91 | 89 | 179 | 75.5 | 935863 |
分割符 它是CSV文件,但是分割符是”;”
缺失值 CSV中的第二个个体的weight是缺失的。如果我们没有指定缺失值 (NA = not available) 标记符, 我们将无法进行统计分析。
从数组中创建: pandas.DataFrame 也可以视为1D序列, 例如数组或列表的字典,如果我们有3个numpy
数组:
In [4]:
import numpy as np
t = np.linspace(-6, 6, 20)
sin_t = np.sin(t)
cos_t = np.cos(t)
我们可以将他们暴露为pandas.DataFrame:
In [5]:
pandas.DataFrame({'t': t, 'sin': sin_t, 'cos': cos_t})
Out[5]:
cos | sin | t | |
---|---|---|---|
0 | 0.960170 | 0.279415 | -6.000000 |
1 | 0.609977 | 0.792419 | -5.368421 |
2 | 0.024451 | 0.999701 | -4.736842 |
3 | -0.570509 | 0.821291 | -4.105263 |
4 | -0.945363 | 0.326021 | -3.473684 |
5 | -0.955488 | -0.295030 | -2.842105 |
6 | -0.596979 | -0.802257 | -2.210526 |
7 | -0.008151 | -0.999967 | -1.578947 |
8 | 0.583822 | -0.811882 | -0.947368 |
9 | 0.950551 | -0.310567 | -0.315789 |
10 | 0.950551 | 0.310567 | 0.315789 |
11 | 0.583822 | 0.811882 | 0.947368 |
12 | -0.008151 | 0.999967 | 1.578947 |
13 | -0.596979 | 0.802257 | 2.210526 |
14 | -0.955488 | 0.295030 | 2.842105 |
15 | -0.945363 | -0.326021 | 3.473684 |
16 | -0.570509 | -0.821291 | 4.105263 |
17 | 0.024451 | -0.999701 | 4.736842 |
18 | 0.609977 | -0.792419 | 5.368421 |
19 | 0.960170 | -0.279415 | 6.000000 |
其他输入: pandas 可以从SQL、excel文件或者其他格式输入数。见pandas文档。
3.1.1.2.2 操作数据
data
是pandas.DataFrame, 与R的dataframe类似:
In [6]:
data.shape # 40行8列
Out[6]:
(40, 8)
In [7]:
data.columns # 有列
Out[7]:
Index([u'Unnamed: 0', u'Gender', u'FSIQ', u'VIQ', u'PIQ', u'Weight', u'Height',
u'MRI_Count'],
dtype='object')
In [8]:
print(data['Gender']) # 列可以用名字访问
0 Female
1 Male
2 Male
3 Male
4 Female
5 Female
6 Female
7 Female
8 Male
9 Male
10 Female
11 Male
12 Male
13 Female
14 Female
15 Female
16 Female
17 Male
18 Female
19 Male
20 Male
21 Male
22 Female
23 Male
24 Female
25 Male
26 Female
27 Male
28 Female
29 Female
30 Female
31 Male
32 Male
33 Male
34 Female
35 Female
36 Male
37 Female
38 Male
39 Male
Name: Gender, dtype: object
In [9]:
# 简单选择器
data[data['Gender'] == 'Female']['VIQ'].mean()
Out[9]:
109.45
注意: 对于一个大dataframe的快速预览,用它的
describe
方法: pandas.DataFrame.describe()。
groupby: 根据类别变量的值拆分dataframe:
In [10]:
groupby_gender = data.groupby('Gender')
for gender, value in groupby_gender['VIQ']:
print((gender, value.mean()))
('Female', 109.45)
('Male', 115.25)
groupby_gender是一个强力的对象,暴露了结果dataframes组的许多操作:
In [11]:
groupby_gender.mean()
Out[11]:
Unnamed: 0 | FSIQ | VIQ | PIQ | Weight | Height | MRI_Count | |
---|---|---|---|---|---|---|---|
Gender | |||||||
Female | 19.65 | 111.9 | 109.45 | 110.45 | 137.200000 | 65.765000 | 862654.6 |
Male | 21.35 | 115.0 | 115.25 | 111.60 | 166.444444 | 71.431579 | 954855.4 |
在groupby_gender
上使用tab-完成来查找更多。其他的常见分组函数是median, count (对于检查不同子集的缺失值数量很有用) 或sum。Groupby评估是懒惰模式,因为在应用聚合函数之前不会进行什么工作。
练习
- 完整人口VIO的平均值是多少?
- 这项研究中包含了多少男性 / 女性?
- 提示 使用‘tab完成’来寻找可以调用的方法, 替换在上面例子中的‘mean’。
- 对于男性和女性来说,以log为单位显示的MRI count平均值是多少?
注意: 上面的绘图中使用了
groupby_gender.boxplot
(见这个例子)。
3.1.1.2.3 绘制数据
Pandas提供一些绘图工具 (pandas.tools.plotting
, 后面使用的是matplotlib) 来显示在dataframes数据的统计值:
散点图矩阵:
In [15]:
from pandas.tools import plotting
plotting.scatter_matrix(data[['Weight', 'Height', 'MRI_Count']])
Out[15]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x105c34810>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a0ade10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a2d80d0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x10a33b210>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a3be450>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a40d9d0>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x10a49dc10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a51f850>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10a5902d0>]], dtype=object)
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):
In [16]:
plotting.scatter_matrix(data[['PIQ', 'VIQ', 'FSIQ']])
Out[16]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x10a918b50>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10aa38710>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10ab29910>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x10ab8e790>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10ae207d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10abbd090>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x10af140d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10af89cd0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x10affa410>]], dtype=object)
两个总体
IQ指标是双峰的, 似乎有两个子总体。
练习
只绘制男性的散点图矩阵,然后是只有女性的。你是否认为2个子总体与性别相关?