8. 用时间戳和另一列分组
# 读取employee数据集,用HIRE_DATE列创造一个DatetimeIndex
In[131]: employee = pd.read_csv('data/employee.csv',
parse_dates=['JOB_DATE', 'HIRE_DATE'],
index_col='HIRE_DATE')
employee.head()
Out[131]:
# 对性别做分组,查看二者的工资
In[132]: employee.groupby('GENDER')['BASE_SALARY'].mean().round(-2)
Out[132]: GENDER
Female 52200.0
Male 57400.0
Name: BASE_SALARY, dtype: float64
# 根据聘用日期,每10年分一组,查看工资情况
In[133]: employee.resample('10AS')['BASE_SALARY'].mean().round(-2)
Out[133]: HIRE_DATE
1958-01-01 81200.0
1968-01-01 106500.0
1978-01-01 69600.0
1988-01-01 62300.0
1998-01-01 58200.0
2008-01-01 47200.0
Freq: 10AS-JAN, Name: BASE_SALARY, dtype: float64
# 如果要按性别和五年分组,可以在groupby后面调用resample
In[134]: sal_avg = employee.groupby('GENDER').resample('10AS')['BASE_SALARY'].mean().round(-2)
sal_avg
Out[134]: GENDER HIRE_DATE
Female 1975-01-01 51600.0
1985-01-01 57600.0
1995-01-01 55500.0
2005-01-01 51700.0
2015-01-01 38600.0
Male 1958-01-01 81200.0
1968-01-01 106500.0
1978-01-01 72300.0
1988-01-01 64600.0
1998-01-01 59700.0
2008-01-01 47200.0
Name: BASE_SALARY, dtype: float64
# 对性别unstack
In[135]: sal_avg.unstack('GENDER')
Out[135]:
# 上面数据的问题,是分组不恰当造成的。
# 第一名男性受聘于1958年
In[136]: employee[employee['GENDER'] == 'Male'].index.min()
Out[136]: Timestamp('1958-12-29 00:00:00')
# 第一名女性受聘于1975年
In[137]: employee[employee['GENDER'] == 'Female'].index.min()
Out[137]: Timestamp('1975-06-09 00:00:00')
# 为了解决前面的分组问题,必须将日期和性别同时分组
In[138]: sal_avg2 = employee.groupby(['GENDER', pd.Grouper(freq='10AS')])['BASE_SALARY'].mean().round(-2)
sal_avg2
Out[138]: GENDER HIRE_DATE
Female 1968-01-01 NaN
1978-01-01 57100.0
1988-01-01 57100.0
1998-01-01 54700.0
2008-01-01 47300.0
Male 1958-01-01 81200.0
1968-01-01 106500.0
1978-01-01 72300.0
1988-01-01 64600.0
1998-01-01 59700.0
2008-01-01 47200.0
Name: BASE_SALARY, dtype: float64
# 再对性别做unstack
In[139]: sal_final = sal_avg2.unstack('GENDER')
sal_final
Out[139]:
原理
# groupby返回对象包含resample方法,但相反却不成立
In[140]: 'resample' in dir(employee.groupby('GENDER'))
Out[140]: True
In[141]: 'groupby' in dir(employee.resample('10AS'))
Out[141]: False
更多
# 通过加9,手工创造时间区间
In[142]: years = sal_final.index.year
years_right = years + 9
sal_final.index = years.astype(str) + '-' + years_right.astype(str)
sal_final
Out[142]:
# 也可以使用cut函数创造基于每名员工受聘年份的等宽间隔
In[143]: cuts = pd.cut(employee.index.year, bins=5, precision=0)
cuts.categories.values
Out[143]: array([Interval(1958.0, 1970.0, closed='right'),
Interval(1970.0, 1981.0, closed='right'),
Interval(1981.0, 1993.0, closed='right'),
Interval(1993.0, 2004.0, closed='right'),
Interval(2004.0, 2016.0, closed='right')], dtype=object)
In[144]: employee.groupby([cuts, 'GENDER'])['BASE_SALARY'].mean().unstack('GENDER').round(-2)
Out[144]: