0

I have a large dataset containing football data and I would like to figure something out. The dataset contains data from a lot of games, and for every game, all the players in the club is mentioned in a column in the format of list. I would like to find out how I can get an output like this

Player     clubs
    Tom     3 
    Car     2 
    Jon     2
    Tex     1

etc.

This is the code I have tried for it, but i get an error: unhashable type: Series

df = pd.DataFrame({'club': ['Bath', 'Bath', 'Bristol', 'Bristol', 'Bristol', 'Swindon'], 
                   'Players': [[ 'Tom', 'Jon', 'Tex'],[ 'Tom', 'Jon', 'Tex'],[ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Car', 'Snow', 'Tom'], [ 'Tom',  'Car',  'Jon']]})

tr = df.groupby('club')
trt = pd.Series([bg for bgs in tr.players_as_list for bg in bgs])
trt.value_counts()
3
  • Please provide a minimal reproducible example of your problem Commented May 12, 2022 at 18:28
  • 3
    Please see How to make good pandas examples and edit your question to include a minimal reproducible example with sample input and expected output Commented May 12, 2022 at 18:32
  • All sorted. Apologies Commented May 12, 2022 at 18:39

3 Answers 3

2

As you have a Series of list, it will be slow to use explode and drop_duplicates.

Here pure python should be more efficient:

from collections import defaultdict
d = defaultdict(set)
for c, l in zip(df['club'], df['Players']):
    for k in l:
        d[k].add(c)
out = pd.Series({k: len(s) for k,s in d.items()}).sort_values(ascending=False)

output:

Tom     3
Jon     2
Car     2
Tex     1
Snow    1
dtype: int64

comparison on 600K rows:

# explode + groupby
605 ms ± 20.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# explode + value_counts
476 ms ± 17.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pure python
259 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

2 Comments

You are right. Standard python is more efficient!!! (but it's not very elegant :P)
@Corralien Define elegant? I think it is as elegant as the pandas solution ;)
1

You can try explode the Players column then count the unique values of club column in each group

df = (df.explode('Players')
      .groupby('Players')['club'].nunique()
      .to_frame('clubs').reset_index())

# or

df = (df.explode('Players').drop_duplicates(['Players', 'club'])
      .groupby('Players')['club'].count()
      .to_frame('clubs').reset_index())

# or

df = (df.explode('Players')
      .groupby('Players')['club'].unique().apply(len)
      .to_frame('clubs').reset_index())
print(df)

  Players  clubs
0     Car      2
1     Jon      2
2    Snow      1
3     Tex      1
4     Tom      3

2 Comments

How could I order this by the number of clubs?
@sqlbik Just add .sort_values('clubs', ascending=False).
1

You can use value_counts instead of groupby after explode your dataframe:

out = (df.explode('Players').drop_duplicates(['Players', 'club'])
         .value_counts('Players').rename('clubs').reset_index())
print(out)

# Output
  Players  clubs
0     Tom      3
1     Car      2
2     Jon      2
3    Snow      1
4     Tex      1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.