MultiIndex / advanced indexing

This section covers indexing with a MultiIndexand other advanced indexing features.

See the Indexing and Selecting Data for general indexing documentation.

Warning

Whether a copy or a reference is returned for a setting operation maydepend on the context. This is sometimes called chained assignment andshould be avoided. See Returning a View versus Copy.

See the cookbook for some advanced strategies.

Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level indexing is very exciting as it opens the door to somequite sophisticated data analysis and manipulation, especially for working withhigher dimensional data. In essence, it enables you to store and manipulatedata with an arbitrary number of dimensions in lower dimensional datastructures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexingand how it integrates with all of the pandas indexing functionalitydescribed above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll shownon-trivial applications to illustrate how it aids in structuring data foranalysis.

See the cookbook for some advanced strategies.

Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codesand MultiIndex.set_labels to MultiIndex.set_codes.

Creating a MultiIndex (hierarchical index) object

The MultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. Youcan think of MultiIndex as an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays()), an array of tuples (usingMultiIndex.from_tuples()), a crossed set of iterables (usingMultiIndex.from_product()), or a DataFrame (usingMultiIndex.from_frame()). The Index constructor will attempt to returna MultiIndex when it is passed a list of tuples. The following examplesdemonstrate different ways to initialize MultiIndexes.

In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ...: 
 
In [2]: tuples = list(zip(*arrays))
 
In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
 
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
 
In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])
 
In [6]: s = pd.Series(np.random.randn(8), index=index)
 
In [7]: s
Out[7]: 
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64

When you want every pairing of the elements in two iterables, it can be easierto use the MultiIndex.from_product() method:

In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
 
In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
Out[9]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

You can also construct a MultiIndex from a DataFrame directly, usingthe method MultiIndex.from_frame(). This is a complementary method toMultiIndex.to_frame().

New in version 0.24.0.

In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
   ....:                    ['foo', 'one'], ['foo', 'two']],
   ....:                   columns=['first', 'second'])
   ....: 
 
In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series orDataFrame to construct a MultiIndex automatically:

In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
   ....:           np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
   ....: 
 
In [13]: s = pd.Series(np.random.randn(8), index=arrays)
 
In [14]: s
Out[14]: 
bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64
 
In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
 
In [16]: df
Out[16]: 
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

All of the MultiIndex constructors accept a names argument which storesstring names for the levels themselves. If no names are provided, None willbe assigned:

In [17]: df.index.names
Out[17]: FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levelsof the index is up to you:

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
 
In [19]: df
Out[19]: 
first        bar                 baz                 foo                 qux          
second       one       two       one       two       one       two       one       two
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747
 
In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo          
second             one       two       one       two       one       two
first second                                                            
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output abit easier on the eyes. Note that how the index is displayed can be controlled using themulti_sparse option in pandas.set_options():

In [21]: with pd.option_context('display.multi_sparse', False):
   ....:     df
   ....:

It’s worth keeping in mind that there’s nothing preventing you from usingtuples as atomic labels on an axis:

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64

The reason that the MultiIndex matters is that it can allow you to dogrouping, selection, and reshaping operations as we will describe below and insubsequent areas of the documentation. As you will see in later sections, youcan find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, youmay wish to generate your own MultiIndex when preparing the data set.

Reconstructing the level labels

The method get_level_values() will return a vector of the labels for eachlocation at a particular level:

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')
 
In [24]: index.get_level_values('second')
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can selectdata by a “partial” label identifying a subgroup in the data. Partialselection “drops” levels of the hierarchical index in the result in acompletely analogous way to selecting a column in a regular DataFrame:

In [25]: df['bar']
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920
 
In [26]: df['bar', 'one']
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64
 
In [27]: df['bar']['one']
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64
 
In [28]: s['qux']
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64

See Cross-section with hierarchical index for how to selecton a deeper level.

Defined levels

The MultiIndex keeps all the defined levels of an index, evenif they are not actually used. When slicing an index, you may notice this.For example:

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
 
In [30]: df[['foo','qux']].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid a recomputation of the levels in order to make slicinghighly performant. If you want to see only the used levels, you can use theget_level_values() method.

In [31]: df[['foo', 'qux']].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)
 
# for a specific level
In [32]: df[['foo', 'qux']].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To reconstruct the MultiIndex with only the used levels, theremove_unused_levels() method may be used.

New in version 0.20.0.

In [33]: new_mi = df[['foo', 'qux']].columns.remove_unused_levels()
 
In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

Data alignment and using reindex

Operations between differently-indexed objects having MultiIndex on theaxes will work as you expect; data alignment will work the same as an Index oftuples:

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
     two   -4.209138
baz  one   -0.989859
     two    2.143608
foo  one    1.443110
     two   -1.413542
qux  one         NaN
     two         NaN
dtype: float64
 
In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
     two         NaN
baz  one   -0.989859
     two         NaN
foo  one    1.443110
     two         NaN
qux  one   -2.079150
     two         NaN
dtype: float64

The reindex() method of Series/DataFrames can becalled with another MultiIndex, or even a list or array of tuples:

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
       two      -2.104569
baz    one      -0.494929
dtype: float64
 
In [38]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64

Advanced indexing with hierarchical index

Syntactically integrating MultiIndex in advanced indexing with .loc is abit challenging, but we’ve made every effort to do so. In general, MultiIndexkeys take the form of tuples. For example, the following works as you would expect:

In [39]: df = df.T
 
In [40]: df
Out[40]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747
 
In [41]: df.loc[('bar', 'two')]
Out[41]: 
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64

Note that df.loc['bar', 'two'] would also work in this example, but this shorthandnotation can lead to ambiguity in general.

If you also want to index a specific column with .loc, you must use a tuplelike this:

In [42]: df.loc[('bar', 'two'), 'A']
Out[42]: 0.8052440253863785

You don’t have to specify all levels of the MultiIndex by passing only thefirst elements of the tuple. For example, you can use “partial” indexing toget all elements with bar in the first level as follows:

df.loc[‘bar’]

This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalentto df.loc['bar',] in this example).

“Partial” slicing also works quite nicely.

In [43]: df.loc['baz':'foo']
Out[43]: 
                     A         B         C
first second                              
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

In [44]: df.loc[('baz', 'two'):('qux', 'one')]
Out[44]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
 
In [45]: df.loc[('baz', 'two'):'foo']
Out[45]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

In [46]: df.loc[[('bar', 'two'), ('qux', 'one')]]
Out[46]: 
                     A         B         C
first second                              
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466

Note

It is important to note that tuples and lists are not treated identicallyin pandas when it comes to indexing. Whereas a tuple is interpreted as onemulti-level key, a list is used to specify several keys. Or in other words,tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several complete MultiIndex keys,whereas a tuple of lists refer to several values within a level:

In [47]: s = pd.Series([1, 2, 3, 4, 5, 6],
   ....:               index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]))
   ....: 
 
In [48]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[48]: 
A  c    1
B  d    5
dtype: int64
 
In [49]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[49]: 
A  c    1
   d    2
B  c    4
   d    5
dtype: int64

Using slicers

You can slice a MultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label,including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all thedeeper levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

Warning

You should specify all axes in the .loc specifier, meaning the indexer for the index andfor the columns. There are some ambiguous cases where the passed indexer could be mis-interpretedas indexing both axes, rather than into say the MultiIndex for the rows.

You should do this:

df.loc[(slice('A1', 'A3'), ...), :]             # noqa: E999

You should not do this:

df.loc[(slice('A1', 'A3'), ...)]                # noqa: E999

In [50]: def mklbl(prefix, n):
   ....:     return ["%s%s" % (prefix, i) for i in range(n)]
   ....: 
 
In [51]: miindex = pd.MultiIndex.from_product([mklbl('A', 4),
   ....:                                       mklbl('B', 2),
   ....:                                       mklbl('C', 4),
   ....:                                       mklbl('D', 2)])
   ....: 
 
In [52]: micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
   ....:                                        ('b', 'foo'), ('b', 'bah')],
   ....:                                       names=['lvl0', 'lvl1'])
   ....: 
 
In [53]: dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
   ....:                       .reshape((len(miindex), len(micolumns))),
   ....:                     index=miindex,
   ....:                     columns=micolumns).sort_index().sort_index(axis=1)
   ....: 
 
In [54]: dfmi
Out[54]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254
 
[64 rows x 4 columns]

Basic MultiIndex slicing using slices, lists, and labels.

In [55]: dfmi.loc[(slice('A1', 'A3'), slice(None), ['C1', 'C3']), :]
Out[55]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254
 
[24 rows x 4 columns]

You can use pandas.IndexSlice to facilitate a more natural syntaxusing :, rather than using slice(None).

In [56]: idx = pd.IndexSlice
 
In [57]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[57]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254
 
[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multipleaxes at the same time.

In [58]: dfmi.loc['A1', (slice(None), 'foo')]
Out[58]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126
 
[16 rows x 2 columns]
 
In [59]: dfmi.loc[idx[:, :, ['C1', 'C3']], idx[:, 'foo']]
Out[59]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254
 
[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

In [60]: mask = dfmi[('a', 'foo')] > 200
 
In [61]: dfmi.loc[idx[mask, :, ['C1', 'C3']], idx[:, 'foo']]
Out[61]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

You can also specify the axis argument to .loc to interpret the passedslicers on a single axis.

In [62]: dfmi.loc(axis=0)[:, :, ['C1', 'C3']]
Out[62]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254
 
[32 rows x 4 columns]

Furthermore, you can set the values using the following methods.

In [63]: df2 = dfmi.copy()
 
In [64]: df2.loc(axis=0)[:, :, ['C1', 'C3']] = -10
 
In [65]: df2
Out[65]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
 
[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [66]: df2 = dfmi.copy()
 
In [67]: df2.loc[idx[:, :, ['C1', 'C3']], :] = df2 * 1000
 
In [68]: df2
Out[68]: 
lvl0              a               b        
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000
 
[64 rows x 4 columns]

Cross-section

The xs() method of DataFrame additionally takes a level argument to makeselecting data at a particular level of a MultiIndex easier.

In [69]: df
Out[69]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747
 
In [70]: df.xs('one', level='second')
Out[70]: 
              A         B         C
first                              
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466

# using the slicers
In [71]: df.loc[(slice(None), 'one'), :]
Out[71]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466

You can also select on the columns with xs, byproviding the axis argument.

In [72]: df = df.T
 
In [73]: df.xs('one', level='second', axis=1)
Out[73]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

# using the slicers
In [74]: df.loc[:, (slice(None), 'one')]
Out[74]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

xs also allows selection with multiple keys.

In [75]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1)
Out[75]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681

# using the slicers
In [76]: df.loc[:, ('bar', 'one')]
Out[76]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

You can pass drop_level=False to xs to retainthe level that was selected.

In [77]: df.xs('one', level='second', axis=1, drop_level=False)
Out[77]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

Compare the above with the result using drop_level=True (the default value).

In [78]: df.xs('one', level='second', axis=1, drop_level=True)
Out[78]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

Advanced reindexing and alignment

Using the parameter level in the reindex() andalign() methods of pandas objects is useful to broadcastvalues across a level. For instance:

In [79]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x', 'y']],
   ....:                      codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
   ....: 
 
In [80]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)
 
In [81]: df
Out[81]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520
 
In [82]: df2 = df.mean(level=0)
 
In [83]: df2
Out[83]: 
             0         1
one   1.060074 -0.109716
zero  1.271532  0.713416
 
In [84]: df2.reindex(df.index, level=0)
Out[84]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416
 
# aligning
In [85]: df_aligned, df2_aligned = df.align(df2, level=0)
 
In [86]: df_aligned
Out[86]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520
 
In [87]: df2_aligned
Out[87]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

Swapping levels with swaplevel

The swaplevel() method can switch the order of two levels:

In [88]: df[:5]
Out[88]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520
 
In [89]: df[:5].swaplevel(0, 1, axis=0)
Out[89]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

Reordering levels with reorder_levels

The reorder_levels() method generalizes the swaplevelmethod, allowing you to permute the hierarchical index levels in one step:

In [90]: df[:5].reorder_levels([1, 0], axis=0)
Out[90]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

Renaming names of an Index or MultiIndex

The rename() method is used to rename the labels of aMultiIndex, and is typically used to rename the columns of a DataFrame.The columns argument of rename allows a dictionary to be specifiedthat includes only the columns you wish to rename.

In [91]: df.rename(columns={0: "col0", 1: "col1"})
Out[91]: 
            col0      col1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

This method can also be used to rename specific labels of the main indexof the DataFrame.

In [92]: df.rename(index={"one": "two", "y": "z"})
Out[92]: 
               0         1
two  z  1.519970 -0.493662
     x  0.600178  0.274230
zero z  0.132885 -0.023688
     x  2.410179  1.450520

The rename_axis() method is used to rename the name of aIndex or MultiIndex. In particular, the names of the levels of aMultiIndex can be specified, which is useful if reset_index() is laterused to move the values from the MultiIndex to a column.

In [93]: df.rename_axis(index=['abc', 'def'])
Out[93]: 
                 0         1
abc  def                    
one  y    1.519970 -0.493662
     x    0.600178  0.274230
zero y    0.132885 -0.023688
     x    2.410179  1.450520

Note that the columns of a DataFrame are an index, so that usingrename_axis with the columns argument will change the name of thatindex.

In [94]: df.rename_axis(columns="Cols").columns
Out[94]: RangeIndex(start=0, stop=2, step=1, name='Cols')

Both rename and rename_axis support specifying a dictionary,Series or a mapping function to map labels/names to new values.

Sorting a MultiIndex

For MultiIndex-ed objects to be indexed and sliced effectively,they need to be sorted. As with any index, you can use sort_index().

In [95]: import random
 
In [96]: random.shuffle(tuples)
 
In [97]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))
 
In [98]: s
Out[98]: 
bar  two    0.206053
baz  one   -0.251905
foo  one   -2.213588
     two    1.063327
qux  one    1.266143
     two    0.299368
bar  one   -0.863838
baz  two    0.408204
dtype: float64
 
In [99]: s.sort_index()
Out[99]: 
bar  one   -0.863838
     two    0.206053
baz  one   -0.251905
     two    0.408204
foo  one   -2.213588
     two    1.063327
qux  one    1.266143
     two    0.299368
dtype: float64
 
In [100]: s.sort_index(level=0)
Out[100]: 
bar  one   -0.863838
     two    0.206053
baz  one   -0.251905
     two    0.408204
foo  one   -2.213588
     two    1.063327
qux  one    1.266143
     two    0.299368
dtype: float64
 
In [101]: s.sort_index(level=1)
Out[101]: 
bar  one   -0.863838
baz  one   -0.251905
foo  one   -2.213588
qux  one    1.266143
bar  two    0.206053
baz  two    0.408204
foo  two    1.063327
qux  two    0.299368
dtype: float64

You may also pass a level name to sort_index if the MultiIndex levelsare named.

In [102]: s.index.set_names(['L1', 'L2'], inplace=True)
 
In [103]: s.sort_index(level='L1')
Out[103]: 
L1   L2 
bar  one   -0.863838
     two    0.206053
baz  one   -0.251905
     two    0.408204
foo  one   -2.213588
     two    1.063327
qux  one    1.266143
     two    0.299368
dtype: float64
 
In [104]: s.sort_index(level='L2')
Out[104]: 
L1   L2 
bar  one   -0.863838
baz  one   -0.251905
foo  one   -2.213588
qux  one    1.266143
bar  two    0.206053
baz  two    0.408204
foo  two    1.063327
qux  two    0.299368
dtype: float64

On higher dimensional objects, you can sort any of the other axes by level ifthey have a MultiIndex:

In [105]: df.T.sort_index(level=1, axis=1)
Out[105]: 
        one      zero       one      zero
          x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688

Indexing will work even if the data are not sorted, but will be ratherinefficient (and show a PerformanceWarning). It will alsoreturn a copy of the data rather than a view:

In [106]: dfm = pd.DataFrame({'jim': [0, 0, 1, 1],
   .....:                     'joe': ['x', 'x', 'z', 'y'],
   .....:                     'jolie': np.random.rand(4)})
   .....: 
 
In [107]: dfm = dfm.set_index(['jim', 'joe'])
 
In [108]: dfm
Out[108]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   z    0.537020
    y    0.110968

In [4]: dfm.loc[(1, 'z')]
PerformanceWarning: indexing past lexsort depth may impact performance.
 
Out[4]:
           jolie
jim joe
1   z    0.64094

Furthermore, if you try to index something that is not fully lexsorted, this can raise:

In [5]: dfm.loc[(0, 'y'):(1, 'z')]
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

The is_lexsorted() method on a MultiIndex shows if theindex is sorted, and the lexsort_depth property returns the sort depth:

In [109]: dfm.index.is_lexsorted()
Out[109]: False
 
In [110]: dfm.index.lexsort_depth
Out[110]: 1

In [111]: dfm = dfm.sort_index()
 
In [112]: dfm
Out[112]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   y    0.110968
    z    0.537020
 
In [113]: dfm.index.is_lexsorted()
Out[113]: True
 
In [114]: dfm.index.lexsort_depth
Out[114]: 2

And now selection works as expected.

In [115]: dfm.loc[(0, 'y'):(1, 'z')]
Out[115]: 
            jolie
jim joe          
1   y    0.110968
    z    0.537020

Take methods

Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also providesthe take() method that retrieves elements along a given axis at the givenindices. The given indices must be either a list or an ndarray of integerindex positions. take will also accept negative integers as relative positions to the end of the object.

In [116]: index = pd.Index(np.random.randint(0, 1000, 10))
 
In [117]: index
Out[117]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')
 
In [118]: positions = [0, 9, 3]
 
In [119]: index[positions]
Out[119]: Int64Index([214, 329, 567], dtype='int64')
 
In [120]: index.take(positions)
Out[120]: Int64Index([214, 329, 567], dtype='int64')
 
In [121]: ser = pd.Series(np.random.randn(10))
 
In [122]: ser.iloc[positions]
Out[122]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64
 
In [123]: ser.take(positions)
Out[123]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifiesrow or column positions.

In [124]: frm = pd.DataFrame(np.random.randn(5, 3))
 
In [125]: frm.take([1, 4, 3])
Out[125]: 
          0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855
 
In [126]: frm.take([0, 2], axis=1)
Out[126]: 
          0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704

It is important to note that the take method on pandas objects are notintended to work on boolean indices and may return unexpected results.

In [127]: arr = np.random.randn(10)
 
In [128]: arr.take([False, False, True, True])
Out[128]: array([-1.1935, -1.1935,  0.6775,  0.6775])
 
In [129]: arr[[0, 1]]
Out[129]: array([-1.1935,  0.6775])
 
In [130]: ser = pd.Series(np.random.randn(10))
 
In [131]: ser.take([False, False, True, True])
Out[131]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64
 
In [132]: ser.iloc[[0, 1]]
Out[132]: 
0    0.233141
1   -0.223540
dtype: float64

Finally, as a small note on performance, because the take method handlesa narrower range of inputs, it can offer performance that is a good dealfaster than fancy indexing.

In [133]: arr = np.random.randn(10000, 5)
 
In [134]: indexer = np.arange(10000)
 
In [135]: random.shuffle(indexer)
 
In [136]: %timeit arr[indexer]
   .....: %timeit arr.take(indexer, axis=0)
   .....: 
155 us +- 7.75 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
41.5 us +- 530 ns per loop (mean +- std. dev. of 7 runs, 10000 loops each)

In [137]: ser = pd.Series(arr[:, 0])
 
In [138]: %timeit ser.iloc[indexer]
   .....: %timeit ser.take(indexer)
   .....: 
121 us +- 4.48 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)
110 us +- 3 us per loop (mean +- std. dev. of 7 runs, 10000 loops each)

Index types

We have discussed MultiIndex in the previous sections pretty extensively.Documentation about DatetimeIndex and PeriodIndex are shown here,and documentation about TimedeltaIndex is found here.

In the following sub-sections we will highlight some other index types.

CategoricalIndex

CategoricalIndex is a type of index that is useful for supportingindexing with duplicates. This is a container around a Categoricaland allows efficient indexing and storage of an index with a large number of duplicated elements.

In [139]: from pandas.api.types import CategoricalDtype
 
In [140]: df = pd.DataFrame({'A': np.arange(6),
   .....:                    'B': list('aabbca')})
   .....: 
 
In [141]: df['B'] = df['B'].astype(CategoricalDtype(list('cab')))
 
In [142]: df
Out[142]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a
 
In [143]: df.dtypes
Out[143]: 
A       int64
B    category
dtype: object
 
In [144]: df.B.cat.categories
Out[144]: Index(['c', 'a', 'b'], dtype='object')

Setting the index will create a CategoricalIndex.

In [145]: df2 = df.set_index('B')
 
In [146]: df2.index
Out[146]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Indexing with getitem/.iloc/.loc works similarly to an Index with duplicates.The indexers must be in the category or the operation will raise a KeyError.

In [147]: df2.loc['a']
Out[147]: 
   A
B   
a  0
a  1
a  5

The CategoricalIndex is preserved after indexing:

In [148]: df2.loc['a'].index
Out[148]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Sorting the index will sort by the order of the categories (recall that wecreated the index with CategoricalDtype(list('cab')), so the sortedorder is cab).

In [149]: df2.sort_index()
Out[149]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

Groupby operations on the index will preserve the index nature as well.

In [150]: df2.groupby(level=0).sum()
Out[150]: 
   A
B   
c  4
a  6
b  5
 
In [151]: df2.groupby(level=0).sum().index
Out[151]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, name='B', dtype='category')

Reindexing operations will return a resulting index based on the type of the passedindexer. Passing a list will return a plain-old Index; indexing witha Categorical will return a CategoricalIndex, indexed according to the categoriesof the passed Categorical dtype. This allows one to arbitrarily index these even withvalues not in the categories, similarly to how you can reindex any pandas index.

In [152]: df2.reindex(['a', 'e'])
Out[152]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN
 
In [153]: df2.reindex(['a', 'e']).index
Out[153]: Index(['a', 'a', 'a', 'e'], dtype='object', name='B')
 
In [154]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde')))
Out[154]: 
     A
B     
a  0.0
a  1.0
a  5.0
e  NaN
 
In [155]: df2.reindex(pd.Categorical(['a', 'e'], categories=list('abcde'))).index
Out[155]: CategoricalIndex(['a', 'a', 'a', 'e'], categories=['a', 'b', 'c', 'd', 'e'], ordered=False, name='B', dtype='category')

Warning

Reshaping and Comparison operations on a CategoricalIndex must have the same categoriesor a TypeError will be raised.

In [9]: df3 = pd.DataFrame({'A': np.arange(6), 'B': pd.Series(list('aabbca')).astype('category')})
 
In [11]: df3 = df3.set_index('B')
 
In [11]: df3.index
Out[11]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['a', 'b', 'c'], ordered=False, name='B', dtype='category')
 
In [12]: pd.concat([df2, df3])
TypeError: categories must match existing categories when appending

Int64Index and RangeIndex

Warning

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.

Int64Index is a fundamental basic index in pandas.This is an immutable array implementing an ordered, sliceable set.Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.

RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects.RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analogous to Python range types.

Float64Index

By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation.This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly thesame.

In [156]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])
 
In [157]: indexf
Out[157]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
 
In [158]: sf = pd.Series(range(5), index=indexf)
 
In [159]: sf
Out[159]: 
1.5    0
2.0    1
3.0    2
4.5    3
5.0    4
dtype: int64

Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0).

In [160]: sf[3]
Out[160]: 2
 
In [161]: sf[3.0]
Out[161]: 2
 
In [162]: sf.loc[3]
Out[162]: 2
 
In [163]: sf.loc[3.0]
Out[163]: 2

The only positional indexing is via iloc.

In [164]: sf.iloc[3]
Out[164]: 3

A scalar index that is not found will raise a KeyError.Slicing is primarily on the values of the index when using [],ix,loc, andalways positional when using iloc. The exception is when the slice isboolean, in which case it will always be positional.

In [165]: sf[2:4]
Out[165]: 
2.0    1
3.0    2
dtype: int64
 
In [166]: sf.loc[2:4]
Out[166]: 
2.0    1
3.0    2
dtype: int64
 
In [167]: sf.iloc[2:4]
Out[167]: 
3.0    2
4.5    3
dtype: int64

In float indexes, slicing using floats is allowed.

In [168]: sf[2.1:4.6]
Out[168]: 
3.0    2
4.5    3
dtype: int64
 
In [169]: sf.loc[2.1:4.6]
Out[169]: 
3.0    2
4.5    3
dtype: int64

In non-float indexes, slicing using floats will raise a TypeError.

In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
 
In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

Warning

Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError:

In [3]: pd.Series(range(5)).iloc[3.0]
TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhatirregular timedelta-like indexing scheme, but the data is recorded as floats. This could, forexample, be millisecond offsets.

In [170]: dfir = pd.concat([pd.DataFrame(np.random.randn(5, 2),
   .....:                                index=np.arange(5) * 250.0,
   .....:                                columns=list('AB')),
   .....:                   pd.DataFrame(np.random.randn(6, 2),
   .....:                                index=np.arange(4, 10) * 250.1,
   .....:                                columns=list('AB'))])
   .....: 
 
In [171]: dfir
Out[171]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725
1000.4 -0.179734  0.993962
1250.5 -0.212673  0.909872
1500.6 -0.733333 -0.349893
1750.7  0.456434 -0.306735
2000.8  0.553396  0.166221
2250.9 -0.101684 -0.734907

Selection operations then will always work on a value basis, for all selection operators.

In [172]: dfir[0:1000.4]
Out[172]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725
1000.4 -0.179734  0.993962
 
In [173]: dfir.loc[0:1001, 'A']
Out[173]: 
0.0      -0.435772
250.0    -0.808286
500.0    -1.815703
750.0    -0.243487
1000.0    1.162969
1000.4   -0.179734
Name: A, dtype: float64
 
In [174]: dfir.loc[1000.4]
Out[174]: 
A   -0.179734
B    0.993962
Name: 1000.4, dtype: float64

You could retrieve the first 1 second (1000 ms) of data as such:

In [175]: dfir[0:1000]
Out[175]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725

If you need integer based selection, you should use iloc:

In [176]: dfir.iloc[0:5]
Out[176]: 
               A         B
0.0    -0.435772 -1.188928
250.0  -0.808286 -0.284634
500.0  -1.815703  1.347213
750.0  -0.243487  0.514704
1000.0  1.162969 -0.287725

IntervalIndex

New in version 0.20.0.

IntervalIndex together with its own dtype, IntervalDtypeas well as the Interval scalar type, allow first-class support in pandasfor interval notation.

The IntervalIndex allows some unique indexing and is also used as areturn type for the categories in cut() and qcut().

Indexing with an IntervalIndex

An IntervalIndex can be used in Series and in DataFrame as the index.

In [177]: df = pd.DataFrame({'A': [1, 2, 3, 4]},
   .....:                   index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]))
   .....: 
 
In [178]: df
Out[178]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

Label based indexing via .loc along the edges of an interval works as you would expect,selecting that particular interval.

In [179]: df.loc[2]
Out[179]: 
A    2
Name: (1, 2], dtype: int64
 
In [180]: df.loc[[2, 3]]
Out[180]: 
        A
(1, 2]  2
(2, 3]  3

If you select a label contained within an interval, this will also select the interval.

In [181]: df.loc[2.5]
Out[181]: 
A    3
Name: (2, 3], dtype: int64
 
In [182]: df.loc[[2.5, 3.5]]
Out[182]: 
        A
(2, 3]  3
(3, 4]  4

Selecting using an Interval will only return exact matches (starting from pandas 0.25.0).

In [183]: df.loc[pd.Interval(1, 2)]
Out[183]: 
A    2
Name: (1, 2], dtype: int64

Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.

In [7]: df.loc[pd.Interval(0.5, 2.5)]

KeyError: Interval(0.5, 2.5, closed='right')

Selecting all Intervals that overlap a given Interval can be performed using theoverlaps() method to create a boolean indexer.

In [184]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
 
In [185]: idxr
Out[185]: array([ True,  True,  True, False])
 
In [186]: df[idxr]
Out[186]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3

Binning data with cut and qcut

cut() and qcut() both return a Categorical object, and the bins theycreate are stored as an IntervalIndex in its .categories attribute.

In [187]: c = pd.cut(range(4), bins=2)
 
In [188]: c
Out[188]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
 
In [189]: c.categories
Out[189]: 
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
              closed='right',
              dtype='interval[float64]')

cut() also accepts an IntervalIndex for its bins argument, which enablesa useful pandas idiom. First, We call cut() with some data and bins set to afixed number, to generate the bins. Then, we pass the values of .categories as thebins argument in subsequent calls to cut(), supplying new data which will bebinned into the same bins.

In [190]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[190]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]

Any value which falls outside all bins will be assigned a NaN value.

Generating ranges of intervals

If we need intervals on a regular frequency, we can use the interval_range() functionto create an IntervalIndex using various combinations of start, end, and periods.The default frequency for interval_range is a 1 for numeric intervals, and calendar day fordatetime-like intervals:

In [191]: pd.interval_range(start=0, end=5)
Out[191]: 
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')
 
In [192]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4)
Out[192]: 
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03], (2017-01-03, 2017-01-04], (2017-01-04, 2017-01-05]],
              closed='right',
              dtype='interval[datetime64[ns]]')
 
In [193]: pd.interval_range(end=pd.Timedelta('3 days'), periods=3)
Out[193]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

The freq parameter can used to specify non-default frequencies, and can utilize a varietyof frequency aliases with datetime-like intervals:

In [194]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[194]: 
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]],
              closed='right',
              dtype='interval[float64]')
 
In [195]: pd.interval_range(start=pd.Timestamp('2017-01-01'), periods=4, freq='W')
Out[195]: 
IntervalIndex([(2017-01-01, 2017-01-08], (2017-01-08, 2017-01-15], (2017-01-15, 2017-01-22], (2017-01-22, 2017-01-29]],
              closed='right',
              dtype='interval[datetime64[ns]]')
 
In [196]: pd.interval_range(start=pd.Timedelta('0 days'), periods=3, freq='9H')
Out[196]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]],
              closed='right',
              dtype='interval[timedelta64[ns]]')

Additionally, the closed parameter can be used to specify which side(s) the intervalsare closed on. Intervals are closed on the right side by default.

In [197]: pd.interval_range(start=0, end=4, closed='both')
Out[197]: 
IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]],
              closed='both',
              dtype='interval[int64]')
 
In [198]: pd.interval_range(start=0, end=4, closed='neither')
Out[198]: 
IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)],
              closed='neither',
              dtype='interval[int64]')

New in version 0.23.0.

Specifying start, end, and periods will generate a range of evenly spacedintervals from start to end inclusively, with periods number of elementsin the resulting IntervalIndex:

In [199]: pd.interval_range(start=0, end=6, periods=4)
Out[199]: 
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
              closed='right',
              dtype='interval[float64]')
 
In [200]: pd.interval_range(pd.Timestamp('2018-01-01'),
   .....:                   pd.Timestamp('2018-02-28'), periods=3)
   .....: 
Out[200]: 
IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28]],
              closed='right',
              dtype='interval[datetime64[ns]]')

Miscellaneous indexing FAQ

Integer indexing

Label-based indexing with integer axis labels is a thorny topic. It has beendiscussed heavily on mailing lists and among various members of the scientificPython community. In pandas, our general viewpoint is that labels matter morethan integer locations. Therefore, with an integer axis index _only_label-based indexing is possible with the standard tools like .loc. Thefollowing code will generate exceptions:

In [201]: s = pd.Series(range(5))
 
In [202]: s[-1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-202-76c3dce40054> in <module>
----> 1 s[-1]
 
/pandas/pandas/core/series.py in __getitem__(self, key)
   1069         key = com.apply_if_callable(key, self)
   1070         try:
-> 1071             result = self.index.get_value(self, key)
   1072 
   1073             if not is_scalar(result):
 
/pandas/pandas/core/indexes/base.py in get_value(self, series, key)
   4728         k = self._convert_scalar_indexer(k, kind="getitem")
   4729         try:
-> 4730             return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
   4731         except KeyError as e1:
   4732             if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
 
/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
 
/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
 
/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
 
/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
 
/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
 
KeyError: -1
 
In [203]: df = pd.DataFrame(np.random.randn(5, 4))
 
In [204]: df
Out[204]: 
          0         1         2         3
0 -0.130121 -0.476046  0.759104  0.213379
1 -0.082641  0.448008  0.656420 -1.051443
2  0.594956 -0.151360 -0.069303  1.221431
3 -0.182832  0.791235  0.042745  2.069775
4  1.446552  0.019814 -1.389212 -0.702312
 
In [205]: df.loc[-2:]
Out[205]: 
          0         1         2         3
0 -0.130121 -0.476046  0.759104  0.213379
1 -0.082641  0.448008  0.656420 -1.051443
2  0.594956 -0.151360 -0.069303  1.221431
3 -0.182832  0.791235  0.042745  2.069775
4  1.446552  0.019814 -1.389212 -0.702312

This deliberate decision was made to prevent ambiguities and subtle bugs (manyusers reported finding bugs when the API change was made to stop “falling back”on position-based indexing).

Non-monotonic indexes require exact matches

If the index of a Series or DataFrame is monotonically increasing or decreasing, then the boundsof a label-based slice can be outside the range of the index, much like slice indexing anormal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() andis_monotonic_decreasing() attributes.

In [206]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=['data'], data=list(range(5)))
 
In [207]: df.index.is_monotonic_increasing
Out[207]: True
 
# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [208]: df.loc[0:4, :]
Out[208]: 
   data
2     0
3     1
3     2
4     3
 
# slice is are outside the index, so empty DataFrame is returned
In [209]: df.loc[13:15, :]
Out[209]: 
Empty DataFrame
Columns: [data]
Index: []

On the other hand, if the index is not monotonic, then both slice bounds must beunique members of the index.

In [210]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5],
   .....:                   columns=['data'], data=list(range(6)))
   .....: 
 
In [211]: df.index.is_monotonic_increasing
Out[211]: False
 
# OK because 2 and 4 are in the index
In [212]: df.loc[2:4, :]
Out[212]: 
   data
2     0
3     1
1     2
4     3

# 0 is not in the index
In [9]: df.loc[0:4, :]
KeyError: 0
 
# 3 is not a unique label
In [11]: df.loc[2:3, :]
KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check thatan index is weakly monotonic. To check for strict monotonicity, you can combine one of those withthe is_unique() attribute.

In [213]: weakly_monotonic = pd.Index(['a', 'b', 'c', 'c'])
 
In [214]: weakly_monotonic
Out[214]: Index(['a', 'b', 'c', 'c'], dtype='object')
 
In [215]: weakly_monotonic.is_monotonic_increasing
Out[215]: True
 
In [216]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[216]: False

Endpoints are inclusive

Compared with standard Python sequence slicing in which the slice endpoint isnot inclusive, label-based slicing in pandas is inclusive. The primaryreason for this is that it is often not possible to easily determine the“successor” or next element after a particular label in an index. For example,consider the following Series:

In [217]: s = pd.Series(np.random.randn(6), index=list('abcdef'))
 
In [218]: s
Out[218]: 
a    0.301379
b    1.240445
c   -0.846068
d   -0.043312
e   -1.658747
f   -0.819549
dtype: float64

Suppose we wished to slice from c to e, using integers this would beaccomplished as such:

In [219]: s[2:5]
Out[219]: 
c   -0.846068
d   -0.043312
e   -1.658747
dtype: float64

However, if you only had c and e, determining the next element in theindex can be somewhat complicated. For example, the following does not work:

s.loc['c':'e' + 1]

A very common use case is to limit a time series to start and end at twospecific dates. To enable this, we made the design choice to make label-basedslicing include both endpoints:

In [220]: s.loc['c':'e']
Out[220]: 
c   -0.846068
d   -0.043312
e   -1.658747
dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it issomething to watch out for if you expect label-based slicing to behave exactlyin the way that standard Python integer slicing works.

Indexing potentially changes underlying Series dtype

The different indexing operation can potentially change the dtype of a Series.

In [221]: series1 = pd.Series([1, 2, 3])
 
In [222]: series1.dtype
Out[222]: dtype('int64')
 
In [223]: res = series1.reindex([0, 4])
 
In [224]: res.dtype
Out[224]: dtype('float64')
 
In [225]: res
Out[225]: 
0    1.0
4    NaN
dtype: float64

In [226]: series2 = pd.Series([True])
 
In [227]: series2.dtype
Out[227]: dtype('bool')
 
In [228]: res = series2.reindex_like(series1)
 
In [229]: res.dtype
Out[229]: dtype('O')
 
In [230]: res
Out[230]: 
0    True
1     NaN
2     NaN
dtype: object

This is because the (re)indexing operations above silently inserts NaNs and the dtypechanges accordingly. This can cause some issues when using numpy ufuncssuch as numpy.logical_and.

See the this old issue for a moredetailed discussion.