2

I have a dataframe that is 762106 rows x 79 columns. There are 14 'sets' of three columns with each column indicating a different level of 'intensity' for a given feature, and NaN where there is a value in another column. They are already encoded and I want to condense them into a single column so that instead of 42 of these columns I have 14.

A subset can be recreated like this:

import pandas as pd
import numpy as np    
df = pd.DataFrame([[np.nan, 2, np.nan, 1, np.nan, np.nan, np.nan, np.nan, 3],
                    [1, np.nan, np.nan, np.nan, 2, np.nan, 1, np.nan, np.nan],
                    [np.nan, np.nan, 3, 1, np.nan, np.nan, np.nan, 2, np.nan]],
                   columns=['a','aa','aaa','b','bb','bbb','c','cc','ccc'])

Output:

    a       aa      aaa     b       bb      bbb     c       cc      ccc
0   NaN     2.0     NaN     1.0     NaN     NaN     NaN     NaN     3.0
1   1.0     NaN     NaN     NaN     2.0     NaN     1.0     NaN     NaN
2   NaN     NaN     3.0     1.0     NaN     NaN     NaN     2.0     NaN

I want them to look like this:

    a   b   c
0   2   1   3
1   1   2   1
2   3   1   2

My current solution is to take values from aa, aaa, etc using .fillna() and then use .drop() to drop the superfluous columns:

df['a'] = df['a'].fillna(df['aa']).fillna(df['aaa'])
df = df.drop(['aa','aaa'],axis = 1)
    
df['b'] = df['b'].fillna(df['bb']).fillna(df['bbb'])
df = df.drop(['bb','bbb'],axis = 1)

And this works, but I want to know if there is a more elegant way to accomplish this without copy pasting this code block 14 times.

2 Answers 2

1

You can use pandas.DataFrame.groupby with axis = 1 ("columns"):

df.groupby(lambda x: x[0], axis = 1).sum()
     a    b    c
0  2.0  1.0  3.0
1  1.0  2.0  1.0
2  3.0  1.0  2.0

If groupby is used with a function, it's called on each value of the object's index, in this case, the columns names.

Since you can group by any function, it can be a really flexible solution.

Sign up to request clarification or add additional context in comments.

Comments

0

You can grouping by first letter in columns names with GroupBy.first:

df = df.groupby(df.columns.str[0], axis=1).first()
print (df)
     a    b    c
0  2.0  1.0  3.0
1  1.0  2.0  1.0
2  3.0  1.0  2.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.