0

After performing some operation on a dataset I ended up having a data frame like this:

df = pd.DataFrame([1, 2, 3, 4, 5], 
              index = pd.MultiIndex.from_arrays(np.array([[0, 0, 1, 1, 1],
                                                          [1448, 1448, 1448, 1448, 1448],
                                                          [0, 0, 1, 1, 1],
                                                          [1448, 1448, 1448, 1448, 1448],
                                                          ['22', '11', '14', '15', '19']]), 
                                                names = ['meas_id', 'hec_id', 'meas_id', 'hec_id', 'lab_id']))

print(df)

                                      0
meas_id hec_id meas_id hec_id lab_id   
0       1448   0       1448   22      1
                              11      2
1       1448   1       1448   14      3
                              15      4
                              19      5

As you can notice, meas_id and hec_id columns of MultiIndex are duplicated. I could use reset_index([3, 4], drop=True) to drop them, but this resulting data frame is actually a part of a list of data frames, where some of them have those index columns duplicated and some do not. Therefore my question is: how can I drop duplicated index columns from MultiIndex?

Please note that this question refer to duplicates in columns, not in row.

0

2 Answers 2

2

For a solution that generalises in dropping duplicate index names you can do:

import numpy as np

ixs = np.flatnonzero(pd.Index(df.index.names).duplicated()).tolist()
df.reset_index(ixs, drop=True)

                       0
meas_id hec_id lab_id   
0       1448   22      1
               11      2
1       1448   14      3
               15      4
               19      5
Sign up to request clarification or add additional context in comments.

Comments

0

The accepted answer solves the problem, but what I found is that this usually happens due to a misuse of pandas' functions.

Take the following food grading dataframe as example

import pandas as pd
from random import randint

df = pd.DataFrame({
    'food_id': [*list(range(5)), *list(range(5))],
    'grade': [randint(0, 10) for _ in range(10)],
}, index=pd.Index([*[10 for _ in range(5)], *[20 for _ in range(5)]], name='person_id'))
print(df)

            food_id  grade
person_id                
10                0      2
10                1      5
10                2      2
10                3      6
10                4      0
20                0      4
20                1     10
20                2      2
20                3      3
20                4      1

If I wanted to find what's the best graded food for each person, I could first groupby on person_id and get the row with the largest grade.

print(df.groupby(level=0).apply(lambda g: g.nlargest(1, 'grade')))
                      food_id  grade
person_id  person_id                
10         10                3      6
20         20                1     10

but I end up with duplicated IDs, since apply returns the row with its ID, and the original dataframe also has an ID. If I set as_index=False on the group_by call, the original object resets its index but I still get two levels.

print(df.groupby(level=0, as_index=False).apply(lambda g: g.nlargest(1, 'grade')))
                 food_id  grade
  person_id                
0 10               0      4
1 20               3     10

Resetting the index inside the apply yields a similar result.

print(df.groupby(level=0).apply(lambda g: g.nlargest(1, 'grade').reset_index(drop=True)))
                 food_id  grade
person_id                  
10        0        0      4
20        0        3     10

The solution here is to set group_keys=False on groupby. If the apply returns rows with a similar index, it uses that instead of the original index.

print(df.groupby(level=0, group_keys=False).apply(lambda g: g.nlargest(1, 'grade')))
           food_id  grade
person_id                
10               0      4
20               3     10

If food_id is an index, we can skip the apply, using the nlargest function of SeriesGroupBy, like so.

print(df.set_index('food_id', append=True).groupby(level='person_id', group_keys=False)['grade'].nlargest(1))
person_id  food_id
10         0           4
20         3          10
Name: grade, dtype: int64

What I'm getting at is that duplicate index levels are (mostly) the result of a misuse of pandas' operations, and the better solution is to tinker with your operations in order to fix the problem instead of letting it happen and then finding a fix.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.