分组操作(Grouping)
The grouping docs.
Unlike agg, apply’s callable is passed a sub-DataFrame which gives you access to all the columns
In [83]: df = pd.DataFrame({'animal': 'cat dog cat fish dog cat cat'.split(),
....: 'size': list('SSMMMLL'),
....: 'weight': [8, 10, 11, 1, 20, 12, 12],
....: 'adult' : [False] * 5 + [True] * 2}); df
....:
Out[83]:
animal size weight adult
0 cat S 8 False
1 dog S 10 False
2 cat M 11 False
3 fish M 1 False
4 dog M 20 False
5 cat L 12 True
6 cat L 12 True
#List the size of the animals with the highest weight.
In [84]: df.groupby('animal').apply(lambda subf: subf['size'][subf['weight'].idxmax()])
Out[84]:
animal
cat L
dog M
fish M
dtype: object
In [85]: gb = df.groupby(['animal'])
In [86]: gb.get_group('cat')
Out[86]:
animal size weight adult
0 cat S 8 False
2 cat M 11 False
5 cat L 12 True
6 cat L 12 True
Apply to different items in a group
In [87]: def GrowUp(x):
....: avg_weight = sum(x[x['size'] == 'S'].weight * 1.5)
....: avg_weight += sum(x[x['size'] == 'M'].weight * 1.25)
....: avg_weight += sum(x[x['size'] == 'L'].weight)
....: avg_weight /= len(x)
....: return pd.Series(['L',avg_weight,True], index=['size', 'weight', 'adult'])
....:
In [88]: expected_df = gb.apply(GrowUp)
In [89]: expected_df
Out[89]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
In [90]: S = pd.Series([i / 100.0 for i in range(1,11)])
In [91]: def CumRet(x,y):
....: return x * (1 + y)
....:
In [92]: def Red(x):
....: return functools.reduce(CumRet,x,1.0)
....:
In [93]: S.expanding().apply(Red, raw=True)
Out[93]:
0 1.010000
1 1.030200
2 1.061106
3 1.103550
4 1.158728
5 1.228251
6 1.314229
7 1.419367
8 1.547110
9 1.701821
dtype: float64
Replacing some values with mean of the rest of a group
In [94]: df = pd.DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, -1, 1, 2]})
In [95]: gb = df.groupby('A')
In [96]: def replace(g):
....: mask = g < 0
....: g.loc[mask] = g[~mask].mean()
....: return g
....:
In [97]: gb.transform(replace)
Out[97]:
B
0 1.0
1 1.0
2 1.0
3 2.0
Sort groups by aggregated data
In [98]: df = pd.DataFrame({'code': ['foo', 'bar', 'baz'] * 2,
....: 'data': [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
....: 'flag': [False, True] * 3})
....:
In [99]: code_groups = df.groupby('code')
In [100]: agg_n_sort_order = code_groups[['data']].transform(sum).sort_values(by='data')
In [101]: sorted_df = df.loc[agg_n_sort_order.index]
In [102]: sorted_df
Out[102]:
code data flag
1 bar -0.21 True
4 bar -0.59 False
0 foo 0.16 False
3 foo 0.45 True
2 baz 0.33 False
5 baz 0.62 True
Create multiple aggregated columns
In [103]: rng = pd.date_range(start="2014-10-07",periods=10,freq='2min')
In [104]: ts = pd.Series(data = list(range(10)), index = rng)
In [105]: def MyCust(x):
.....: if len(x) > 2:
.....: return x[1] * 1.234
.....: return pd.NaT
.....:
In [106]: mhc = {'Mean' : np.mean, 'Max' : np.max, 'Custom' : MyCust}
In [107]: ts.resample("5min").apply(mhc)
Out[107]:
Custom 2014-10-07 00:00:00 1.234
2014-10-07 00:05:00 NaT
2014-10-07 00:10:00 7.404
2014-10-07 00:15:00 NaT
Max 2014-10-07 00:00:00 2
2014-10-07 00:05:00 4
2014-10-07 00:10:00 7
2014-10-07 00:15:00 9
Mean 2014-10-07 00:00:00 1
2014-10-07 00:05:00 3.5
2014-10-07 00:10:00 6
2014-10-07 00:15:00 8.5
dtype: object
In [108]: ts
Out[108]:
2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
2014-10-07 00:06:00 3
2014-10-07 00:08:00 4
2014-10-07 00:10:00 5
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64
Create a value counts column and reassign back to the DataFrame
In [109]: df = pd.DataFrame({'Color': 'Red Red Red Blue'.split(),
.....: 'Value': [100, 150, 50, 50]}); df
.....:
Out[109]:
Color Value
0 Red 100
1 Red 150
2 Red 50
3 Blue 50
In [110]: df['Counts'] = df.groupby(['Color']).transform(len)
In [111]: df
Out[111]:
Color Value Counts
0 Red 100 3
1 Red 150 3
2 Red 50 3
3 Blue 50 1
Shift groups of the values in a column based on the index
In [112]: df = pd.DataFrame(
.....: {u'line_race': [10, 10, 8, 10, 10, 8],
.....: u'beyer': [99, 102, 103, 103, 88, 100]},
.....: index=[u'Last Gunfighter', u'Last Gunfighter', u'Last Gunfighter',
.....: u'Paynter', u'Paynter', u'Paynter']); df
.....:
Out[112]:
line_race beyer
Last Gunfighter 10 99
Last Gunfighter 10 102
Last Gunfighter 8 103
Paynter 10 103
Paynter 10 88
Paynter 8 100
In [113]: df['beyer_shifted'] = df.groupby(level=0)['beyer'].shift(1)
In [114]: df
Out[114]:
line_race beyer beyer_shifted
Last Gunfighter 10 99 NaN
Last Gunfighter 10 102 99.0
Last Gunfighter 8 103 102.0
Paynter 10 103 NaN
Paynter 10 88 103.0
Paynter 8 100 88.0
Select row with maximum value from each group
In [115]: df = pd.DataFrame({'host':['other','other','that','this','this'],
.....: 'service':['mail','web','mail','mail','web'],
.....: 'no':[1, 2, 1, 2, 1]}).set_index(['host', 'service'])
.....:
In [116]: mask = df.groupby(level=0).agg('idxmax')
In [117]: df_count = df.loc[mask['no']].reset_index()
In [118]: df_count
Out[118]:
host service no
0 other web 2
1 that mail 1
2 this mail 2
Grouping like Python’s itertools.groupby
In [119]: df = pd.DataFrame([0, 1, 0, 1, 1, 1, 0, 1, 1], columns=['A'])
In [120]: df.A.groupby((df.A != df.A.shift()).cumsum()).groups
Out[120]:
{1: Int64Index([0], dtype='int64'),
2: Int64Index([1], dtype='int64'),
3: Int64Index([2], dtype='int64'),
4: Int64Index([3, 4, 5], dtype='int64'),
5: Int64Index([6], dtype='int64'),
6: Int64Index([7, 8], dtype='int64')}
In [121]: df.A.groupby((df.A != df.A.shift()).cumsum()).cumsum()
Out[121]:
0 0
1 1
2 0
3 1
4 2
5 3
6 0
7 1
8 2
Name: A, dtype: int64
Expanding Data
Rolling Computation window based on values instead of counts
Splitting
Create a list of dataframes, split using a delineation based on logic included in rows.
In [122]: df = pd.DataFrame(data={'Case' : ['A','A','A','B','A','A','B','A','A'],
.....: 'Data' : np.random.randn(9)})
.....:
In [123]: dfs = list(zip(*df.groupby((1*(df['Case']=='B')).cumsum().rolling(window=3,min_periods=1).median())))[-1]
In [124]: dfs[0]
Out[124]:
Case Data
0 A 0.174068
1 A -0.439461
2 A -0.741343
3 B -0.079673
In [125]: dfs[1]
Out[125]:
Case Data
4 A -0.922875
5 A 0.303638
6 B -0.917368
In [126]: dfs[2]
Out[126]:
Case Data
7 A -1.624062
8 A -0.758514
Pivot
The Pivot docs.
In [127]: df = pd.DataFrame(data={'Province' : ['ON','QC','BC','AL','AL','MN','ON'],
.....: 'City' : ['Toronto','Montreal','Vancouver','Calgary','Edmonton','Winnipeg','Windsor'],
.....: 'Sales' : [13,6,16,8,4,3,1]})
.....:
In [128]: table = pd.pivot_table(df,values=['Sales'],index=['Province'],columns=['City'],aggfunc=np.sum,margins=True)
In [129]: table.stack('City')
Out[129]:
Sales
Province City
AL All 12.0
Calgary 8.0
Edmonton 4.0
BC All 16.0
Vancouver 16.0
MN All 3.0
Winnipeg 3.0
... ...
All Calgary 8.0
Edmonton 4.0
Montreal 6.0
Toronto 13.0
Vancouver 16.0
Windsor 1.0
Winnipeg 3.0
[20 rows x 1 columns]
Frequency table like plyr in R
In [130]: grades = [48,99,75,80,42,80,72,68,36,78]
In [131]: df = pd.DataFrame( {'ID': ["x%d" % r for r in range(10)],
.....: 'Gender' : ['F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'M', 'M'],
.....: 'ExamYear': ['2007','2007','2007','2008','2008','2008','2008','2009','2009','2009'],
.....: 'Class': ['algebra', 'stats', 'bio', 'algebra', 'algebra', 'stats', 'stats', 'algebra', 'bio', 'bio'],
.....: 'Participated': ['yes','yes','yes','yes','no','yes','yes','yes','yes','yes'],
.....: 'Passed': ['yes' if x > 50 else 'no' for x in grades],
.....: 'Employed': [True,True,True,False,False,False,False,True,True,False],
.....: 'Grade': grades})
.....:
In [132]: df.groupby('ExamYear').agg({'Participated': lambda x: x.value_counts()['yes'],
.....: 'Passed': lambda x: sum(x == 'yes'),
.....: 'Employed' : lambda x : sum(x),
.....: 'Grade' : lambda x : sum(x) / len(x)})
.....:
Out[132]:
Participated Passed Employed Grade
ExamYear
2007 3 2 3 74.000000
2008 3 3 0 68.500000
2009 3 2 2 60.666667
Plot pandas DataFrame with year over year data
To create year and month crosstabulation:
In [133]: df = pd.DataFrame({'value': np.random.randn(36)},
.....: index=pd.date_range('2011-01-01', freq='M', periods=36))
.....:
In [134]: pd.pivot_table(df, index=df.index.month, columns=df.index.year,
.....: values='value', aggfunc='sum')
.....:
Out[134]:
2011 2012 2013
1 -0.560859 0.120930 0.516870
2 -0.589005 -0.210518 0.343125
3 -1.070678 -0.931184 2.137827
4 -1.681101 0.240647 0.452429
5 0.403776 -0.027462 0.483103
6 0.609862 0.033113 0.061495
7 0.387936 -0.658418 0.240767
8 1.815066 0.324102 0.782413
9 0.705200 -1.403048 0.628462
10 -0.668049 -0.581967 -0.880627
11 0.242501 -1.233862 0.777575
12 0.313421 -3.520876 -0.779367
Apply
Rolling Apply to Organize - Turning embedded lists into a multi-index frame
In [135]: df = pd.DataFrame(data={'A' : [[2,4,8,16],[100,200],[10,20,30]], 'B' : [['a','b','c'],['jj','kk'],['ccc']]},index=['I','II','III'])
In [136]: def SeriesFromSubList(aList):
.....: return pd.Series(aList)
.....:
In [137]: df_orgz = pd.concat(dict([ (ind,row.apply(SeriesFromSubList)) for ind,row in df.iterrows() ]))
Rolling Apply with a DataFrame returning a Series
Rolling Apply to multiple columns where function calculates a Series before a Scalar from the Series is returned
In [138]: df = pd.DataFrame(data=np.random.randn(2000,2)/10000,
.....: index=pd.date_range('2001-01-01',periods=2000),
.....: columns=['A','B']); df
.....:
Out[138]:
A B
2001-01-01 0.000032 -0.000004
2001-01-02 -0.000001 0.000207
2001-01-03 0.000120 -0.000220
2001-01-04 -0.000083 -0.000165
2001-01-05 -0.000047 0.000156
2001-01-06 0.000027 0.000104
2001-01-07 0.000041 -0.000101
... ... ...
2006-06-17 -0.000034 0.000034
2006-06-18 0.000002 0.000166
2006-06-19 0.000023 -0.000081
2006-06-20 -0.000061 0.000012
2006-06-21 -0.000111 0.000027
2006-06-22 -0.000061 -0.000009
2006-06-23 0.000074 -0.000138
[2000 rows x 2 columns]
In [139]: def gm(aDF,Const):
.....: v = ((((aDF.A+aDF.B)+1).cumprod())-1)*Const
.....: return (aDF.index[0],v.iloc[-1])
.....:
In [140]: S = pd.Series(dict([ gm(df.iloc[i:min(i+51,len(df)-1)],5) for i in range(len(df)-50) ])); S
Out[140]:
2001-01-01 -0.001373
2001-01-02 -0.001705
2001-01-03 -0.002885
2001-01-04 -0.002987
2001-01-05 -0.002384
2001-01-06 -0.004700
2001-01-07 -0.005500
...
2006-04-28 -0.002682
2006-04-29 -0.002436
2006-04-30 -0.002602
2006-05-01 -0.001785
2006-05-02 -0.001799
2006-05-03 -0.000605
2006-05-04 -0.000541
Length: 1950, dtype: float64
Rolling apply with a DataFrame returning a Scalar
Rolling Apply to multiple columns where function returns a Scalar (Volume Weighted Average Price)
In [141]: rng = pd.date_range(start = '2014-01-01',periods = 100)
In [142]: df = pd.DataFrame({'Open' : np.random.randn(len(rng)),
.....: 'Close' : np.random.randn(len(rng)),
.....: 'Volume' : np.random.randint(100,2000,len(rng))}, index=rng); df
.....:
Out[142]:
Open Close Volume
2014-01-01 0.011174 -0.653039 1581
2014-01-02 0.214258 1.314205 1707
2014-01-03 -1.046922 -0.341915 1768
2014-01-04 -0.752902 -1.303586 836
2014-01-05 -0.410793 0.396288 694
2014-01-06 0.648401 -0.548006 796
2014-01-07 0.737320 0.481380 265
... ... ... ...
2014-04-04 0.120378 -2.548128 564
2014-04-05 0.231661 0.223346 1908
2014-04-06 0.952664 1.228841 1090
2014-04-07 -0.176090 0.552784 1813
2014-04-08 1.781318 -0.795389 1103
2014-04-09 -0.753493 -0.018815 1456
2014-04-10 -1.047997 1.138197 1193
[100 rows x 3 columns]
In [143]: def vwap(bars): return ((bars.Close*bars.Volume).sum()/bars.Volume.sum())
In [144]: window = 5
In [145]: s = pd.concat([ (pd.Series(vwap(df.iloc[i:i+window]), index=[df.index[i+window]])) for i in range(len(df)-window) ]);
In [146]: s.round(2)
Out[146]:
2014-01-06 -0.03
2014-01-07 0.07
2014-01-08 -0.40
2014-01-09 -0.81
2014-01-10 -0.63
2014-01-11 -0.86
2014-01-12 -0.36
...
2014-04-04 -1.27
2014-04-05 -1.36
2014-04-06 -0.73
2014-04-07 0.04
2014-04-08 0.21
2014-04-09 0.07
2014-04-10 0.25
Length: 95, dtype: float64