排序
Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both.
By Index
The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its index levels.
In [307]: df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
.....: 'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
.....: 'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
.....:
In [308]: unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'],
.....: columns=['three', 'two', 'one'])
.....:
In [309]: unsorted_df
Out[309]:
three two one
a NaN 0.708543 0.036274
d -0.540166 0.586626 NaN
c 0.410238 1.121731 1.044630
b -0.282532 -2.038777 -0.490032
# DataFrame
In [310]: unsorted_df.sort_index()
Out[310]:
three two one
a NaN 0.708543 0.036274
b -0.282532 -2.038777 -0.490032
c 0.410238 1.121731 1.044630
d -0.540166 0.586626 NaN
In [311]: unsorted_df.sort_index(ascending=False)
Out[311]:
three two one
d -0.540166 0.586626 NaN
c 0.410238 1.121731 1.044630
b -0.282532 -2.038777 -0.490032
a NaN 0.708543 0.036274
In [312]: unsorted_df.sort_index(axis=1)
Out[312]:
one three two
a 0.036274 NaN 0.708543
d NaN -0.540166 0.586626
c 1.044630 0.410238 1.121731
b -0.490032 -0.282532 -2.038777
# Series
In [313]: unsorted_df['three'].sort_index()
Out[313]:
a NaN
b -0.282532
c 0.410238
d -0.540166
Name: three, dtype: float64
By Values
The Series.sort_values() method is used to sort a Series by its values. The DataFrame.sort_values() method is used to sort a DataFrame by its column or row values. The optional by parameter to DataFrame.sort_values() may used to specify one or more columns to use to determine the sorted order.
In [314]: df1 = pd.DataFrame({'one':[2,1,1,1],'two':[1,3,2,4],'three':[5,4,3,2]})
In [315]: df1.sort_values(by='two')
Out[315]:
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2
The by parameter can take a list of column names, e.g.:
In [316]: df1[['one', 'two', 'three']].sort_values(by=['one','two'])
Out[316]:
one two three
2 1 2 3
1 1 3 4
3 1 4 2
0 2 1 5
These methods have special treatment of NA values via the na_position argument:
In [317]: s[2] = np.nan
In [318]: s.sort_values()
Out[318]:
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
2 NaN
5 NaN
dtype: object
In [319]: s.sort_values(na_position='first')
Out[319]:
2 NaN
5 NaN
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
dtype: object
By Indexes and Values
New in version 0.23.0.
Strings passed as the by
parameter to DataFrame.sort_values() may refer to either columns or index level names.
# Build MultiIndex
In [320]: idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2),
.....: ('b', 2), ('b', 1), ('b', 1)])
.....:
In [321]: idx.names = ['first', 'second']
# Build DataFrame
In [322]: df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)},
.....: index=idx)
.....:
In [323]: df_multi
Out[323]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
Sort by ‘second’ (index) and ‘A’ (column)
In [324]: df_multi.sort_values(by=['second', 'A'])
Out[324]:
A
first second
b 1 1
1 2
a 1 6
b 2 3
a 2 4
2 5
Note: If a string matches both a column name and an index level name then a warning is issued and the column takes precedence. This will result in an ambiguity error in a future version.
searchsorted
Series has the searchsorted() method, which works similarly to numpy.ndarray.searchsorted().
In [325]: ser = pd.Series([1, 2, 3])
In [326]: ser.searchsorted([0, 3])
Out[326]: array([0, 2])
In [327]: ser.searchsorted([0, 4])
Out[327]: array([0, 3])
In [328]: ser.searchsorted([1, 3], side='right')
Out[328]: array([1, 3])
In [329]: ser.searchsorted([1, 3], side='left')
Out[329]: array([0, 2])
In [330]: ser = pd.Series([3, 1, 2])
In [331]: ser.searchsorted([0, 3], sorter=np.argsort(ser))
Out[331]: array([0, 2])
smallest / largest values
Series has the nsmallest() and nlargest() methods which return the smallest or largest n values. For a large Series this can be much faster than sorting the entire Series and calling head(n) on the result.
In [332]: s = pd.Series(np.random.permutation(10))
In [333]: s
Out[333]:
0 8
1 2
2 9
3 5
4 6
5 0
6 1
7 7
8 4
9 3
dtype: int64
In [334]: s.sort_values()
Out[334]:
5 0
6 1
1 2
9 3
8 4
3 5
4 6
7 7
0 8
2 9
dtype: int64
In [335]: s.nsmallest(3)
Out[335]:
5 0
6 1
1 2
dtype: int64
In [336]: s.nlargest(3)
Out[336]:
2 9
0 8
7 7
dtype: int64
DataFrame
also has the nlargest
and nsmallest
methods.
In [337]: df = pd.DataFrame({'a': [-2, -1, 1, 10, 8, 11, -1],
.....: 'b': list('abdceff'),
.....: 'c': [1.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})
.....:
In [338]: df.nlargest(3, 'a')
Out[338]:
a b c
5 11 f 3.0
3 10 c 3.2
4 8 e NaN
In [339]: df.nlargest(5, ['a', 'c'])
Out[339]:
a b c
6 -1 f 4.0
5 11 f 3.0
3 10 c 3.2
4 8 e NaN
2 1 d 4.0
In [340]: df.nsmallest(3, 'a')
Out[340]:
a b c
0 -2 a 1.0
1 -1 b 2.0
6 -1 f 4.0
In [341]: df.nsmallest(5, ['a', 'c'])
Out[341]:
a b c
0 -2 a 1.0
2 1 d 4.0
4 8 e NaN
1 -1 b 2.0
6 -1 f 4.0
Sorting by a multi-index column
You must be explicit about sorting when the column is a multi-index, and fully specify all levels to by.
In [342]: df1.columns = pd.MultiIndex.from_tuples([('a','one'),('a','two'),('b','three')])
In [343]: df1.sort_values(by=('a','two'))
Out[343]:
a b
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2