MultiIndex / Advanced Indexing
This section covers indexing with a MultiIndex and other advanced indexing features.
See the Indexing and Selecting Data for general indexing documentation.
Warning
Whether a copy or a reference is returned for a setting operation may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
See the cookbook for some advanced strategies.
Hierarchical indexing (MultiIndex)
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).
In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.
See the cookbook for some advanced strategies.
Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codes and MultiIndex.set_labels to MultiIndex.set_codes.
Creating a MultiIndex (hierarchical index) object
The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
...:
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [5]: index
Out[5]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:
In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]
In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second'])
Out[9]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
codes=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=['first', 'second'])
You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.from_frame(). This is a complementary method to MultiIndex.to_frame().
New in version 0.24.0.
In [10]: df = pd.DataFrame([['bar', 'one'], ['bar', 'two'],
....: ['foo', 'one'], ['foo', 'two']],
....: columns=['first', 'second'])
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex(levels=[['bar', 'foo'], ['one', 'two']],
codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['first', 'second'])
As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:
In [12]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
....:
In [13]: s = pd.Series(np.random.randn(8), index=arrays)
In [14]: s
Out[14]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
In [16]: df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:
In [17]: df.index.names Out[17]: FrozenList([None, None])
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index) In [19]: df Out[19]: first bar baz foo qux second one two one two one two one two A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169 B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737 C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747 In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
https://pandas.pydata.org/pandas-docs/version/0.24.2/user_guide/advanced.html