Pandas groupby using multiple/list of columns with same categories?

Question

Basically, I want to turn:

    Date    0       1       2
0   10-1    thing1  None    None
1   10-1    thing1  thing1  None
2   10-2    thing2  thing1  None
3   10-3    thing1  thing1  thing2

into a groupby:

    Date    0               
0   10-1    thing1  3
2   10-2    thing1  1
            thing2  1
3   10-3    thing1  2
            thing2  1

Details: Basically, I have a complicated "object" column from a JSON import. It's a list of dicts, each of which contains another list with the contents I'm interested in. I've managed to both "flatten" this final list to separate columns (0,1,2 above) as well as extract just the list itself to a column (i.e. [0,1,2]). The elements of these columns are all the same categorical variables (thing1, thing2, etc.)

I could imagine you could create new rows for each of the 1, and 2 columns, storing their values in the 0 columns, but if you can aggregate these values and groupby directly, that'd be great.

BENY · Accepted Answer · 2018-10-22 02:30:29Z

3

I will using get_dummies, since it also adding missing level like thing2 in 10-1

pd.get_dummies(df.set_index('Date').replace('None',np.nan),prefix='',prefix_sep='').stack().sum(level=[0,1])
Out[185]: 
Date        
10-1  thing1    3
      thing2    0
10-2  thing1    1
      thing2    1
10-3  thing1    2
      thing2    1
dtype: uint8

answered Oct 22, 2018 at 2:30

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

rje Over a year ago

This is really cool. I couldn't help adding another solution that works similarly but has a column for thing1 and thing2

user770901 Over a year ago

I'm going to accept this answer, as this is the one that seemed to work best for me. Out of curiosity: is there a way of doing this on a much more complex data frame, but selecting the columns to do the get_dummies? Or are you obligated to create a new DF with only the categorical columns? Thanks!

Peter Leimbigler · Accepted Answer · 2018-10-22 02:17:15Z

0

There has got to be a better way, but this is what came to mind:

(df.groupby('Date')
   .apply(lambda x: x.drop('Date', axis=1).apply(lambda y: y.value_counts()))
   .sum(axis=1)
   .astype(int))

Date        
10-1  thing1    3
10-2  thing1    1
      thing2    1
10-3  thing1    2
      thing2    1
dtype: int64

answered Oct 22, 2018 at 2:17

Peter Leimbigler

11.1k1 gold badge27 silver badges39 bronze badges

Comments

rje · Accepted Answer · 2018-10-22 02:30:20Z

0

This works for me:

df.melt(id_vars='Date').groupby('Date')['value'].value_counts()

output:

Date  value 
10-1  thing1    3
10-2  thing1    1
      thing2    1
10-3  thing1    2
      thing2    1

Explanation: melt puts all the values from your three value columns in a single column, while keeping the date for each value. We then group by date and count the values.

By the way, the example above returns a series with a multi-index of Date and value. If you want a dataframe you can use:

df.melt(id_vars='Date').groupby('Date').agg({'value':'value_counts'})

Which returns an actual dataframe with the same structure, so it still has a multi-index with levels Date and value.

edited Oct 22, 2018 at 2:30

answered Oct 22, 2018 at 2:21

rje

6,5161 gold badge23 silver badges42 bronze badges

Comments

rje · Accepted Answer · 2018-10-22 02:50:54Z

0

Ok, here's yet another answer. This one uses get_dummies because I like that particular solution. But this time I'm going to make columns with counts for thing1 and thing2:

pd.get_dummies(df, columns=df.columns[1:], prefix="", prefix_sep="")\
    .groupby(axis=1, level=0).sum().groupby('Date').sum()

The result is:

    thing1  thing2
Date        
10-1    3   0
10-2    1   1
10-3    2   1

I just thought this was cool enough that I wanted to share it here :)

answered Oct 22, 2018 at 2:50

rje

6,5161 gold badge23 silver badges42 bronze badges

1 Comment

user770901 Over a year ago

This didn't seem to work for me...it created columns for each date, and thereby couldn't group on the 'date' column. It's possible I did something incorrectly though; my actual data frame is much more complicated, so I extracted the categorical columns (and date) and tried it on that (see comment below)

Collectives™ on Stack Overflow

Pandas groupby using multiple/list of columns with same categories?

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related