1

Basically, I want to turn:

    Date    0       1       2
0   10-1    thing1  None    None
1   10-1    thing1  thing1  None
2   10-2    thing2  thing1  None
3   10-3    thing1  thing1  thing2

into a groupby:

    Date    0               
0   10-1    thing1  3
2   10-2    thing1  1
            thing2  1
3   10-3    thing1  2
            thing2  1

Details: Basically, I have a complicated "object" column from a JSON import. It's a list of dicts, each of which contains another list with the contents I'm interested in. I've managed to both "flatten" this final list to separate columns (0,1,2 above) as well as extract just the list itself to a column (i.e. [0,1,2]). The elements of these columns are all the same categorical variables (thing1, thing2, etc.)

I could imagine you could create new rows for each of the 1, and 2 columns, storing their values in the 0 columns, but if you can aggregate these values and groupby directly, that'd be great.

4 Answers 4

3

I will using get_dummies, since it also adding missing level like thing2 in 10-1

pd.get_dummies(df.set_index('Date').replace('None',np.nan),prefix='',prefix_sep='').stack().sum(level=[0,1])
Out[185]: 
Date        
10-1  thing1    3
      thing2    0
10-2  thing1    1
      thing2    1
10-3  thing1    2
      thing2    1
dtype: uint8
Sign up to request clarification or add additional context in comments.

2 Comments

This is really cool. I couldn't help adding another solution that works similarly but has a column for thing1 and thing2
I'm going to accept this answer, as this is the one that seemed to work best for me. Out of curiosity: is there a way of doing this on a much more complex data frame, but selecting the columns to do the get_dummies? Or are you obligated to create a new DF with only the categorical columns? Thanks!
0

There has got to be a better way, but this is what came to mind:

(df.groupby('Date')
   .apply(lambda x: x.drop('Date', axis=1).apply(lambda y: y.value_counts()))
   .sum(axis=1)
   .astype(int))

Date        
10-1  thing1    3
10-2  thing1    1
      thing2    1
10-3  thing1    2
      thing2    1
dtype: int64

Comments

0

This works for me:

df.melt(id_vars='Date').groupby('Date')['value'].value_counts()

output:

Date  value 
10-1  thing1    3
10-2  thing1    1
      thing2    1
10-3  thing1    2
      thing2    1

Explanation: melt puts all the values from your three value columns in a single column, while keeping the date for each value. We then group by date and count the values.

By the way, the example above returns a series with a multi-index of Date and value. If you want a dataframe you can use:

df.melt(id_vars='Date').groupby('Date').agg({'value':'value_counts'})

Which returns an actual dataframe with the same structure, so it still has a multi-index with levels Date and value.

Comments

0

Ok, here's yet another answer. This one uses get_dummies because I like that particular solution. But this time I'm going to make columns with counts for thing1 and thing2:

pd.get_dummies(df, columns=df.columns[1:], prefix="", prefix_sep="")\
    .groupby(axis=1, level=0).sum().groupby('Date').sum()

The result is:

    thing1  thing2
Date        
10-1    3   0
10-2    1   1
10-3    2   1

I just thought this was cool enough that I wanted to share it here :)

1 Comment

This didn't seem to work for me...it created columns for each date, and thereby couldn't group on the 'date' column. It's possible I did something incorrectly though; my actual data frame is much more complicated, so I extracted the categorical columns (and date) and tried it on that (see comment below)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.