1

I have a DataFrame, where all columns are in binary form, i.e; taking either 0 or 1 as value as shown below.

import pandas as pd
df = pd.DataFrame({'qualified_exam':[1,1,0,0,1,1],'gender_M':[1,0,1,0,1,1],'employed':[1,0,0,1,1,1],'married':[0,1,0,1,1,0]})
print(df)
      qualified_exam  gender_M  employed  married
   0               1         1         1        0
   1               1         0         0        1
   2               0         1         0        0
   3               0         0         1        1
   4               1         1         1        1
   5               1         1         1        0

I want to create a DataFrame where I want to measure sum/mean of column qualified _exam by grouping by all the remaining 3 columns individually - gender_M, employed, married.

Final DataFrame should look something like this -

            sum_0   sum_1    mean_0    mean_1
gender_M        1       3      0.50      0.75
employed        1       3      0.50      0.75
 married        2       2      0.66      0.66

I tried it by doing groupby() and agg() for each of the 3 columns individually, and then appending the result one by one. This is too cumbersome. I am sure there is a better way.

1 Answer 1

2

Use DataFrame.melt with aggregate sum with mean, reshape by DataFrame.unstack and last flatten MultiIndex in columns:

df1 = (df.melt('qualified_exam')
         .groupby(['variable', 'value'])['qualified_exam']
         .agg(['sum','mean'])
         .unstack()

         )
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
          sum_0  sum_1    mean_0    mean_1
variable                                  
employed      1      3  0.500000  0.750000
gender_M      1      3  0.500000  0.750000
married       2      2  0.666667  0.666667

Or use DataFrame.pivot_table:

df1 = (df.melt('qualified_exam')
         .pivot_table(index='variable', 
                      columns='value', 
                      values='qualified_exam', 
                      aggfunc=('sum','mean')))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
            mean_0    mean_1  sum_0  sum_1
variable                                  
employed  0.500000  0.750000    1.0    3.0
gender_M  0.500000  0.750000    1.0    3.0
married   0.666667  0.666667    2.0    2.0
Sign up to request clarification or add additional context in comments.

13 Comments

Hallo, Many thanks for answer. This code fails when any column has only 0s or 1s. For eg; Just change 'gender_M':[0,0,0,0,0,0] and the result wll be 0,0,0,0.
Yes, I at the moment I need these 4 columns - ´sum_0, sum_1, mean_0, mean_1´. But, I may extend it to include 'count' as well.
Sorry, I did not understand - 'Is possible specify 0,1 ? '.
@Let me do a detailed check to be sure that the program is behaving as expected. I'll be right back.
Hi Jazreal, I took a close look. Actually, the result is not correct. Have a look at the top, the result is married 1 2 0.5 0.50, where as expected one was married 2 2 0.66 0.66.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.