0

I have a MultiIndex column Pandas DataFrame A:

       foo    
       bar   baz
s1_a     1     2
s1_b     3     4
s2_a     5     6
s2_b     7     8

I want to group the data on a key from another DataFrame B:

       key
s1_a     1
s1_b     1
s2_a     2
s2_b     2

For a DataFrame without a MultiIndex, I would do:

pd.merge(A, B, left_index=True, right_index=True).groupby('key').sum()

but this does not work with the MultiIndex. The desired result is

    foo
    bar    baz
1     4      6
2    12     14

How can I achieve this?

2 Answers 2

2

You can subset the level and restore is after the merge using pandas.concat:

C = pd.concat({'foo': (pd.merge(A['foo'], B, left_index=True, right_index=True)
                         .groupby('key')
                         .sum()
                       )
              }, axis=1)

output:

>>> C
    foo    
    bar baz
key        
1     4   6
2    12  14

NB. technically, the code you used should work with a FutureWarning, however you lose the MultiIndex an get tuples instead

>>> pd.merge(A, B, left_index=True, right_index=True).groupby('key').sum()
     (foo, bar)  (foo, baz)
key                        
1             4           6
2            12          14

FutureWarning: merging between different levels is deprecated and will be removed in a future version. (2 levels on the left,1 on the right)
Sign up to request clarification or add additional context in comments.

Comments

1

Series can be passed to groupby directly and the grouper will do appropriate index alignment, so it is possible to do:

A.groupby(B['key']).sum()
    foo    
    bar baz
key        
1     4   6
2    12  14

Setup:

import numpy as np
import pandas as pd

idx = ['s1_a', 's1_b', 's2_a', 's2_b']
A = pd.DataFrame(
    np.arange(1, 9).reshape((-1, 2)),
    index=idx,
    columns=pd.MultiIndex.from_product([['foo'], ['bar', 'baz']])
)

B = pd.DataFrame({'key': [1, 1, 2, 2]}, index=idx)

Note this will work in many cases but is less durable than merging:

B variant 1:

B = pd.DataFrame({'key': [1, 2, 2]}, index=['s1_a', 's1_b', 's2_b'])
      key
s1_a    1  # No s2_a
s1_b    2
s2_b    2


A.groupby(B['key']).sum()

    foo    
    bar baz
key        
1.0   1   2
2.0  10  12

B variant 2:

B = pd.DataFrame({'key': [1, 1, 2, 2]}, index=['s1_a', 's2_a', 's1_b', 's2_b'])
      key
s1_a    1
s2_a    1  # s1_a/s2_a together
s1_b    2
s2_b    2  # s1_b/s2_b together

A.groupby(B['key']).sum()

    foo    
    bar baz
key        
1     6   8
2    10  12

1 Comment

Great! I didn't know you could pass a Series to groupby(). This works perfectly.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.