Drop Non-equivalent Multiindex Rows in Pandas Dataframe

Question

Goal

If sub-column min equals to sub-column max and if min and max sub-column do not equal to each other in any of the column (ao, his, cyp1a2s, cyp3a4s in this case), drop the row.

Example

arrays = [np.array(['ao', 'ao', 'hia', 'hia', 'cyp1a2s', 'cyp1a2s', 'cyp3a4s', 'cyp3a4s']),
          np.array(['min', 'max', 'min', 'max', 'min', 'max', 'min', 'max'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['',''])
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0], 
                            [1, 1, 0, 0, float('nan'), 1, 0, 0],
                            [0, 2, 0, 0, float('nan'), float('nan'), 1, 1],]), index=['1', '2', '3'], columns=index)
df

    ao      hia     cyp1a2s cyp3a4s
    min max min max min max min max
1   1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
2   1.0 1.0 0.0 0.0 NaN 1.0 0.0 0.0
3   0.0 2.0 0.0 0.0 NaN NaN 1.0 1.0

Want

df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0]]), index=['1'], columns=index)
df

    ao      hia     cyp1a2s cyp3a4s
    min max min max min max min max
1   1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0

Attempt

df.apply(lambda x: x['min'].map(str) == x['max'].map(str), axis=1)

KeyError: ('min', 'occurred at index 1')

Note

The actual dataframe has 50+ columns.

jezrael · Accepted Answer · 2020-11-25 05:43:50Z

2

Use DataFrame.xs for DataFrame by second levels of MultiIndex, replace NaNs:

df1 = df.xs('min', axis=1, level=1).fillna('nan')
df2 = df.xs('max', axis=1, level=1).fillna('nan')

Or convert data to strings:

df1 = df.xs('min', axis=1, level=1).astype('str')
df2 = df.xs('max', axis=1, level=1).astype('str')

Compare Dataframes by DataFrame.eq and test if all Trues by DataFrame.all and last filter by boolean indexing:

df = df[df1.eq(df2).all(axis=1)]
print (df)
    ao       hia      cyp1a2s     cyp3a4s     
   min  max  min  max     min max     min  max
1  1.0  1.0  0.0  0.0     NaN NaN     0.0  0.0

answered Nov 25, 2020 at 5:43

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

June Over a year ago

It is very kind to explain the code (and with helpful links)! I wonder why df.apply will not work in this case.

frankr6591 · Accepted Answer · 2020-11-26 17:55:44Z

1

The reason df.apply() didn't work is you needed to reference 2 levels of columns.

Also .map(str) was invalid for mapping from float64... used .astype(str)

The following work for >1 columns:

eqCols = ['cyp1a2s','hia']
neqCols = list(set(df.xs('min', level=1, axis=1).columns) - set(eqCols))
EQ = lambda r,c : r[c]['min'].astype(str) == r[c]['max'].astype(str)
df[df.apply(lambda r: ([EQ(r,c) for c in eqCols][0]) & ([(not EQ(r,c)) for c in neqCols][0]), axis=1)]

edited Nov 26, 2020 at 17:55

answered Nov 25, 2020 at 14:02

frankr6591

1,2671 gold badge9 silver badges16 bronze badges

2 Comments

June Over a year ago

Hi, I have 50+ columns. Repeating for each column would be time-consuming.

frankr6591 Over a year ago

See latest changes... divided into Equality columns and NotEqual columns; ultimately doing the same # of comparisons.. not sure which is faster df.apply() or df.eq()?

Collectives™ on Stack Overflow

Drop Non-equivalent Multiindex Rows in Pandas Dataframe

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related