Creating a DataFrame by aggregating by each column

Question

I have a DataFrame, where all columns are in binary form, i.e; taking either 0 or 1 as value as shown below.

import pandas as pd
df = pd.DataFrame({'qualified_exam':[1,1,0,0,1,1],'gender_M':[1,0,1,0,1,1],'employed':[1,0,0,1,1,1],'married':[0,1,0,1,1,0]})
print(df)
      qualified_exam  gender_M  employed  married
   0               1         1         1        0
   1               1         0         0        1
   2               0         1         0        0
   3               0         0         1        1
   4               1         1         1        1
   5               1         1         1        0

I want to create a DataFrame where I want to measure sum/mean of column qualified _exam by grouping by all the remaining 3 columns individually - gender_M, employed, married.

Final DataFrame should look something like this -

            sum_0   sum_1    mean_0    mean_1
gender_M        1       3      0.50      0.75
employed        1       3      0.50      0.75
 married        2       2      0.66      0.66

I tried it by doing groupby() and agg() for each of the 3 columns individually, and then appending the result one by one. This is too cumbersome. I am sure there is a better way.

jezrael · Accepted Answer · 2020-12-03 11:56:56Z

2

Use DataFrame.melt with aggregate sum with mean, reshape by DataFrame.unstack and last flatten MultiIndex in columns:

df1 = (df.melt('qualified_exam')
         .groupby(['variable', 'value'])['qualified_exam']
         .agg(['sum','mean'])
         .unstack()

         )
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
          sum_0  sum_1    mean_0    mean_1
variable                                  
employed      1      3  0.500000  0.750000
gender_M      1      3  0.500000  0.750000
married       2      2  0.666667  0.666667

Or use DataFrame.pivot_table:

df1 = (df.melt('qualified_exam')
         .pivot_table(index='variable', 
                      columns='value', 
                      values='qualified_exam', 
                      aggfunc=('sum','mean')))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
            mean_0    mean_1  sum_0  sum_1
variable                                  
employed  0.500000  0.750000    1.0    3.0
gender_M  0.500000  0.750000    1.0    3.0
married   0.666667  0.666667    2.0    2.0

edited Dec 3, 2020 at 11:56

answered Dec 3, 2020 at 10:56

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

cph_sto Over a year ago

Hallo, Many thanks for answer. This code fails when any column has only 0s or 1s. For eg; Just change 'gender_M':[0,0,0,0,0,0] and the result wll be 0,0,0,0.

cph_sto Over a year ago

Yes, I at the moment I need these 4 columns - ´sum_0, sum_1, mean_0, mean_1´. But, I may extend it to include 'count' as well.

cph_sto Over a year ago

Sorry, I did not understand - 'Is possible specify 0,1 ? '.

cph_sto Over a year ago

@Let me do a detailed check to be sure that the program is behaving as expected. I'll be right back.

cph_sto Over a year ago

Hi Jazreal, I took a close look. Actually, the result is not correct. Have a look at the top, the result is married 1 2 0.5 0.50, where as expected one was married 2 2 0.66 0.66.

|

Collectives™ on Stack Overflow

Creating a DataFrame by aggregating by each column

1 Answer 1

13 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

13 Comments

Your Answer

Sign up or log in

Post as a guest

Related