Issue with value count in Python

Question

I have a large dataset containing football data and I would like to figure something out. The dataset contains data from a lot of games, and for every game, all the players in the club is mentioned in a column in the format of list. I would like to find out how I can get an output like this

Player     clubs
    Tom     3 
    Car     2 
    Jon     2
    Tex     1

etc.

This is the code I have tried for it, but i get an error: unhashable type: Series

df = pd.DataFrame({'club': ['Bath', 'Bath', 'Bristol', 'Bristol', 'Bristol', 'Swindon'], 
                   'Players': [[ 'Tom', 'Jon', 'Tex'],[ 'Tom', 'Jon', 'Tex'],[ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Tom',  'Car',  'Jon']]})

tr = df.groupby('club')
trt = pd.Series([bg for bgs in tr.players_as_list for bg in bgs])
trt.value_counts()

Please provide a minimal reproducible example of your problem — Mose Wintner
– Mose Wintner, Commented May 12, 2022 at 18:28
Please see How to make good pandas examples and edit your question to include a minimal reproducible example with sample input and expected output — G. Anderson
– G. Anderson, Commented May 12, 2022 at 18:32

mozway · Accepted Answer · 2022-05-12 19:08:20Z

2

As you have a Series of list, it will be slow to use explode and drop_duplicates.

Here pure python should be more efficient:

from collections import defaultdict
d = defaultdict(set)
for c, l in zip(df['club'], df['Players']):
    for k in l:
        d[k].add(c)
out = pd.Series({k: len(s) for k,s in d.items()}).sort_values(ascending=False)

output:

Tom     3
Jon     2
Car     2
Tex     1
Snow    1
dtype: int64

comparison on 600K rows:

# explode + groupby
605 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# explode + value_counts
476 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pure python
259 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered May 12, 2022 at 19:08

mozway

267k13 gold badges55 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Corralien Over a year ago

You are right. Standard python is more efficient!!! (but it's not very elegant :P)

mozway Over a year ago

@Corralien Define elegant? I think it is as elegant as the pandas solution ;)

Ynjxsjmh · Accepted Answer · 2022-05-12 18:49:29Z

1

You can try explode the Players column then count the unique values of club column in each group

df = (df.explode('Players')
      .groupby('Players')['club'].nunique()
      .to_frame('clubs').reset_index())

# or

df = (df.explode('Players').drop_duplicates(['Players', 'club'])
      .groupby('Players')['club'].count()
      .to_frame('clubs').reset_index())

# or

df = (df.explode('Players')
      .groupby('Players')['club'].unique().apply(len)
      .to_frame('clubs').reset_index())

print(df)

  Players  clubs
0     Car      2
1     Jon      2
2    Snow      1
3     Tex      1
4     Tom      3

answered May 12, 2022 at 18:49

Ynjxsjmh

30.3k7 gold badges43 silver badges64 bronze badges

2 Comments

sqlbik Over a year ago

How could I order this by the number of clubs?

Ynjxsjmh Over a year ago

@sqlbik Just add .sort_values('clubs', ascending=False).

Corralien · Accepted Answer · 2022-05-12 19:06:37Z

1

You can use value_counts instead of groupby after explode your dataframe:

out = (df.explode('Players').drop_duplicates(['Players', 'club'])
         .value_counts('Players').rename('clubs').reset_index())
print(out)

# Output
  Players  clubs
0     Tom      3
1     Car      2
2     Jon      2
3    Snow      1
4     Tex      1

answered May 12, 2022 at 19:06

Corralien

121k8 gold badges43 silver badges68 bronze badges

Collectives™ on Stack Overflow

Issue with value count in Python

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related