索引类型

索引类型

We have discussed MultiIndex in the previous sections pretty extensively. DatetimeIndex and PeriodIndex are shown here, and information about TimedeltaIndex` is found here.

In the following sub-sections we will highlight some other index types.

ategoricalIndex

CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated elements.

In [125]: from pandas.api.types import CategoricalDtype
In [126]: df = pd.DataFrame({'A': np.arange(6),
   .....:                    'B': list('aabbca')})
   .....: 
In [127]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
In [128]: df
Out[128]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a
In [129]: df.dtypes
Out[129]: 
A       int64
B    category
dtype: object
In [130]: df.B.cat.categories
Out[130]: Index(['c', 'a', 'b'], dtype='object')

Setting the index will create a CategoricalIndex.

In [131]: df2 = df.set_index('B')
In [132]: df2.index
Out[132]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must be in the category or the operation will raise a KeyError.

In [133]: df2.loc['a']
Out[133]: 
   A
B   
a  0
a  1
a  5

The CategoricalIndex is preserved after indexing:

In [134]: df2.loc['a'].index
Out[134]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Sorting the index will sort by the order of the categories (recall that we created the index with CategoricalDtype(list('cab')), so the sorted order is cab).

In [135]: df2.sort_index()
Out[135]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

Groupby operations on the index will preserve the index nature as well.

In [136]: df2.groupby(level=0).sum()
Out[136]: 
   A
B   
c  4
a  6
b  5
In [137]: df2.groupby(level=0).sum().index
Out[137]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the categories, similarly to how you can reindex any pandas index.

In [138]: df2.reindex(['a','e'])
Out[138]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN
In [139]: df2.reindex(['a','e']).index
Out[139]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
In [140]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde')))
Out[140]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN
In [141]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index
Out[141]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')

警告

Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.


In [9]: df3 = pd.DataFrame({'A' : np.arange(6),
                            'B' : pd.Series(list('aabbca')).astype('category')})
In [11]: df3 = df3.set_index('B')
In [11]: df3.index
Out[11]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'a', u'b', u'c'], ordered=False, name=u'B', dtype='category')
In [12]: pd.concat([df2, df3]
TypeError: categories must match existing categories when appending

Int64Index and RangeIndex

警告

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.

Int64Index is a fundamental basic index in pandas. This is an Immutable array implementing an ordered, sliceable set. Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.

RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects. RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analogous to Python range types.

Float64Index

By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly the same.

In [142]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
In [143]: indexf
Out[143]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
In [144]: sf = pd.Series(range(5), index=indexf)
In [145]: sf
Out[145]: 
1.5    0
2.0    1
3.0    2
4.5    3
5.0    4
dtype: int64

Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0).

In [146]: sf[3]
Out[146]: 2
In [147]: sf[3.0]
Out[147]: 2
In [148]: sf.loc[3]
Out[148]: 2
In [149]: sf.loc[3.0]
Out[149]: 2

The only positional indexing is via iloc.

In [150]: sf.iloc[3]
Out[150]: 3

A scalar index that is not found will raise a KeyError. Slicing is primarily on the values of the index when using [],ix,loc, and always positional when using iloc. The exception is when the slice is boolean, in which case it will always be positional.

In [151]: sf[2:4]
Out[151]: 
2.0    1
3.0    2
dtype: int64
In [152]: sf.loc[2:4]
Out[152]: 
2.0    1
3.0    2
dtype: int64
In [153]: sf.iloc[2:4]
Out[153]: 
3.0    2
4.5    3
dtype: int64

In float indexes, slicing using floats is allowed.

In [154]: sf[2.1:4.6]
Out[154]: 
3.0    2
4.5    3
dtype: int64
In [155]: sf.loc[2.1:4.6]
Out[155]: 
3.0    2
4.5    3
dtype: int64

In non-float indexes, slicing using floats will raise a TypeError.

In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

警告

Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError:


In [3]: pd.Series(range(5)).iloc[3.0]
TypeError: cannot do positional indexing on < class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of < type 'float'>

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could for example be millisecond offsets.

In [156]: dfir = pd.concat([pd.DataFrame(np.random.randn(5,2),
   .....:                                index=np.arange(5) * 250.0,
   .....:                                columns=list('AB')),
   .....:                   pd.DataFrame(np.random.randn(6,2),
   .....:                                index=np.arange(4,10) * 250.1,
   .....:                                columns=list('AB'))])
   .....: 
In [157]: dfir
Out[157]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
1000.0 -0.724626  0.178219
1000.4  0.310610 -0.108002
1250.5 -0.974226 -1.147708
1500.6 -2.281374  0.760010
1750.7 -0.742532  1.533318
2000.8  2.495362 -0.432771
2250.9 -0.068954  0.043520

Selection operations then will always work on a value basis, for all selection operators.

In [158]: dfir[0:1000.4]
Out[158]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
1000.0 -0.724626  0.178219
1000.4  0.310610 -0.108002
In [159]: dfir.loc[0:1001,'A']
Out[159]: 
0.0       0.997289
250.0    -0.179129
500.0     0.936914
750.0    -1.003401
1000.0   -0.724626
1000.4    0.310610
Name: A, dtype: float64
In [160]: dfir.loc[1000.4]
Out[160]: 
A    0.310610
B   -0.108002
Name: 1000.4, dtype: float64

You could retrieve the first 1 second (1000 ms) of data as such:

In [161]: dfir[0:1000]
Out[161]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
1000.0 -0.724626  0.178219

If you need integer based selection, you should use iloc:

In [162]: dfir.iloc[0:5]
Out[162]: 
               A         B
0.0     0.997289 -1.693316
250.0  -0.179129 -1.598062
500.0   0.936914  0.912560
750.0  -1.003401  1.632781
1000.0 -0.724626  0.178219

IntervalIndex

New in version 0.20.0.

IntervalIndex together with its own dtype, interval as well as the Interval scalar type, allow first-class support in pandas for interval notation.

The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut() and qcut().

警告

These indexing behaviors are provisional and may change in a future version of pandas.

An IntervalIndex can be used in Series and in DataFrame as the index.

In [163]: df = pd.DataFrame({'A': [1, 2, 3, 4]},
   .....:                    index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
   .....: 
In [164]: df
Out[164]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular interval.

In [165]: df.loc[2]
Out[165]: 
A    2
Name: (1, 2], dtype: int64
In [166]: df.loc[[2, 3]]
Out[166]: 
        A
(1, 2]  2
(2, 3]  3

If you select a label contained within an interval, this will also select the interval.

In [167]: df.loc[2.5]
Out[167]: 
A    3
Name: (2, 3], dtype: int64
In [168]: df.loc[[2.5, 3.5]]
Out[168]: 
        A
(2, 3]  3
(3, 4]  4

Interval and IntervalIndex are used by cut and qcut:

In [169]: c = pd.cut(range(4), bins=2)
In [170]: c
Out[170]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
In [171]: c.categories
Out[171]: 
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]]
              closed='right',
              dtype='interval[float64]')

Furthermore, IntervalIndex allows one to bin other data with these same bins, with NaN representing a missing value similar to other dtypes.

In [172]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[172]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

Generating Ranges of Intervals

If we need intervals on a regular frequency, we can use the interval_range() function to create an IntervalIndex using various combinations of start, end, and periods. The default frequency for interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:

In [173]: pd.interval_range(start=0, end=5)
Out[173]: 
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]]
              closed='right',
              dtype='interval[int64]')
In [174]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)
Out[174]: 
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]]
              closed='right',
              dtype='interval[datetime64[ns]]')
In [175]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)
Out[175]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]]
              closed='right',
              dtype='interval[timedelta64[ns]]')

The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with datetime-like intervals:

In [176]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[176]: 
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]]
              closed='right',
              dtype='interval[float64]')
In [177]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')
Out[177]: 
IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]]
              closed='right',
              dtype='interval[datetime64[ns]]')
In [178]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')
Out[178]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]]
              closed='right',
              dtype='interval[timedelta64[ns]]')

Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are closed on the right side by default.

In [179]: pd.interval_range(start=0, end=4, closed='both')
Out[179]: 
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]]
              closed='both',
              dtype='interval[int64]')
In [180]: pd.interval_range(start=0, end=4, closed='neither')
Out[180]: 
IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)]
              closed='neither',
              dtype='interval[int64]')

New in version 0.23.0.

Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclusively, with periods number of elements in the resulting IntervalIndex:

In [181]: pd.interval_range(start=0, end=6, periods=4)
Out[181]: 
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]]
              closed='right',
              dtype='interval[float64]')
In [182]: pd.interval_range(pd.Timestamp('2018-01-01'), pd.Timestamp('2018-02-28'), periods=3)
Out[182]: 
IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]]
              closed='right',
              dtype='interval[datetime64[ns]]')