3

Goal

If sub-column min equals to sub-column max and if min and max sub-column do not equal to each other in any of the column (ao, his, cyp1a2s, cyp3a4s in this case), drop the row.

Example

arrays = [np.array(['ao', 'ao', 'hia', 'hia', 'cyp1a2s', 'cyp1a2s', 'cyp3a4s', 'cyp3a4s']),
          np.array(['min', 'max', 'min', 'max', 'min', 'max', 'min', 'max'])]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['',''])
df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0], 
                            [1, 1, 0, 0, float('nan'), 1, 0, 0],
                            [0, 2, 0, 0, float('nan'), float('nan'), 1, 1],]), index=['1', '2', '3'], columns=index)
df

    ao      hia     cyp1a2s cyp3a4s
    min max min max min max min max
1   1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0
2   1.0 1.0 0.0 0.0 NaN 1.0 0.0 0.0
3   0.0 2.0 0.0 0.0 NaN NaN 1.0 1.0

Want

df = pd.DataFrame(np.array([[1, 1, 0, 0, float('nan'), float('nan'), 0, 0]]), index=['1'], columns=index)
df

    ao      hia     cyp1a2s cyp3a4s
    min max min max min max min max
1   1.0 1.0 0.0 0.0 NaN NaN 0.0 0.0

Attempt

df.apply(lambda x: x['min'].map(str) == x['max'].map(str), axis=1)

KeyError: ('min', 'occurred at index 1')

Note

The actual dataframe has 50+ columns.

2 Answers 2

2

Use DataFrame.xs for DataFrame by second levels of MultiIndex, replace NaNs:

df1 = df.xs('min', axis=1, level=1).fillna('nan')
df2 = df.xs('max', axis=1, level=1).fillna('nan')

Or convert data to strings:

df1 = df.xs('min', axis=1, level=1).astype('str')
df2 = df.xs('max', axis=1, level=1).astype('str')

Compare Dataframes by DataFrame.eq and test if all Trues by DataFrame.all and last filter by boolean indexing:

df = df[df1.eq(df2).all(axis=1)]
print (df)
    ao       hia      cyp1a2s     cyp3a4s     
   min  max  min  max     min max     min  max
1  1.0  1.0  0.0  0.0     NaN NaN     0.0  0.0
Sign up to request clarification or add additional context in comments.

1 Comment

It is very kind to explain the code (and with helpful links)! I wonder why df.apply will not work in this case.
1

The reason df.apply() didn't work is you needed to reference 2 levels of columns.

Also .map(str) was invalid for mapping from float64... used .astype(str)

The following work for >1 columns:

eqCols = ['cyp1a2s','hia']
neqCols = list(set(df.xs('min', level=1, axis=1).columns) - set(eqCols))
EQ = lambda r,c : r[c]['min'].astype(str) == r[c]['max'].astype(str)
df[df.apply(lambda r: ([EQ(r,c) for c in eqCols][0]) & ([(not EQ(r,c)) for c in neqCols][0]), axis=1)]

2 Comments

Hi, I have 50+ columns. Repeating for each column would be time-consuming.
See latest changes... divided into Equality columns and NotEqual columns; ultimately doing the same # of comparisons.. not sure which is faster df.apply() or df.eq()?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.