pandas - groupby column with multiple values

Question

I want to display the users that have used a value.

import pandas as pd
user = ['alice', 'bob', 'tim', 'alice']
val = [['a','b','c'],['a'],['c','d'],['a','d']]
df = pd.DataFrame({'user': user, 'val': val})

user    val
'alice'      [a, b, c]
'bob'        [a]
'tim'        [c, d]
'alice'      [a, d]

Desired output:

val     users
a      [alice,bob]
b      [alice]
c      [alice,tim]
d      [alice,tim]

Any ideas?

cs95 · Accepted Answer · 2018-03-12 08:11:09Z

4

Step 1
Reshape your data -

from itertools import chain

df = pd.DataFrame({
    'val' : list(chain.from_iterable(df.val.tolist())), 
    'user' : df.user.repeat(df.val.str.len())
})

Step 2
Use groupby + apply + unique:

df.groupby('val').user.apply(lambda x: x.unique().tolist())

val
a    [alice, bob]
b         [alice]
c    [alice, tim]
d    [tim, alice]
Name: user, dtype: object

edited Mar 12, 2018 at 8:11

answered Mar 12, 2018 at 7:40

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pe-perry Over a year ago

It is not the same as OP's desired output.

pe-perry Over a year ago

Shouldn't row 'c' and 'd' be [1, 3] (User 1 and 3 have values 'c' and 'd'), but your codes give [1, 1]?

qrs Over a year ago

I want to show the actual users. One second let me update my output. The users as numbers are confusing, that's my fault.

pe-perry · Accepted Answer · 2018-03-12 08:09:52Z

1

This is my approach.

df2 = (df
       .set_index('user')
       .val
       .apply(pd.Series)
       .stack()
       .reset_index(name='val')  # Reshape the data
       .groupby(['val'])
       .user
       .apply(lambda x: sorted(set(x))))  # Show users that use the value

Output:

print(df2)
# val
# a    [alice, bob]
# b         [alice]
# c    [alice, tim]
# d    [alice, tim]
# Name: user, dtype: object

answered Mar 12, 2018 at 8:09

pe-perry

2,6312 gold badges25 silver badges34 bronze badges

3 Comments

cs95 Over a year ago

@qrs If performance is important, you may want to take another look at the other answers

pe-perry Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Would you mind telling us why your code is faster?

cs95 Over a year ago

Sure, no worries. apply(pd.Series) is generally considered very slow. I learned this the hard way :)

jezrael · Accepted Answer · 2018-03-12 08:23:54Z

1

I think need:

df2 = (pd.DataFrame(df['val'].values.tolist(), index=df['user'].values)
         .stack()
         .reset_index(name='val')
         .groupby('val')['level_0']
         .unique()
         .reset_index()
         .rename(columns={'level_0':'user'})
     )
print(df2)
  val          user
0   a  [alice, bob]
1   b       [alice]
2   c  [alice, tim]
3   d  [tim, alice]

edited Mar 12, 2018 at 8:23

answered Mar 12, 2018 at 8:10

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Comments

David L · Accepted Answer · 2018-03-12 07:40:59Z

0

Don't have enough reputation to write this as a comment, but this question has the answer: How to print dataframe without index

basically, change the last line to:

print(df2.to_string(index=False))

answered Mar 12, 2018 at 7:40

David L

4413 silver badges10 bronze badges

1 Comment

cs95 Over a year ago

No, that is isn't it.

Collectives™ on Stack Overflow

pandas - groupby column with multiple values

4 Answers 4

3 Comments

3 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related