0

I have a pandas dataframe with NBA player stats, and I want to drop the rows of duplicate players. There are duplicates because some players played on multiple teams for the 2020-2021 season, and I want to drop these duplicates. However, for these players that played on multiple teams, there is also a row with that player's combined stats across all teams and a team label of 'TOT', which represents the fact that that player played on 2 or more teams for the season. When I drop duplicate players, I want the row with the team of 'TOT' to remain, and all the other duplicates to be gone. I'm unsure of how to specify that I want to drop all duplicates, but keep the duplicate where df['Team'] = 'TOT'.

Here is what my dataframe looks like: Dataframe

In this example, I want to drop the duplicates of the player 'Jarrett Allen', but keep the row for Jarrett Allen where his team (Tm) is 'TOT'.

1
  • 1
    Please edit your question so all the required info is in the question itself, not in attached images. The question should be phrased as a MRE Commented Feb 1, 2021 at 19:38

3 Answers 3

2

You can just filter out unnecessary rows:

df = df.loc[(df['Rk'].duplicated(keep=False) == False) | (df['Tm'] == 'TOT'), :]

It can be understood this way: From my dataframe take all rows which are not duplicated in column 'Rk' or rows which have 'TOT' in column 'Tm'.

":" at the end means that you want to take all columns.

Sign up to request clarification or add additional context in comments.

1 Comment

Very clever approach.
0

One way is to use a helper column. For example with the following df,

    player  stats team
0      bob      1  ABC
1    alice      2  DEF
2  charlie      3  GHI
3     mary      4  JKL
4     mary      5  MNO
5     mary      6  TOT
6      bob      7  TOT
7      bob      8  VWX

Creating a column where hte value is True if the 'team' value is 'TOT' and False otherwise results in:

import numpy as np

df['multiple_teams'] = np.where(df['team']=='TOT', 'TOT', None)

    player  stats team  multiple_teams
1    alice      2  DEF           False
0      bob      1  ABC           False
6      bob      7  TOT            True
7      bob      8  VWX           False
2  charlie      3  GHI           False
3     mary      4  JKL           False
4     mary      5  MNO           False
5     mary      6  TOT            True

Now we can use the keep parameter of the drop_duplicates() function to decide what to keep. In this case we can achieve the desired result by dropping the values based on the subset of player and multiple_teams with keep=False. This will mean that all duplicates across both columns will be removed from the df. Resulting in:

    player  stats team  multiple_teams
1    alice      2  DEF           False
6      bob      7  TOT            True
2  charlie      3  GHI           False
5     mary      6  TOT            True

Comments

0

You can sort the DataFrame using the key argument, such that 'TOT' is sorted to the bottom and then drop_duplicates, keeping the last.

This guarantees that in the end there is only a single row per player, even if the data are messy and may have multiple 'TOT' rows for a single player, one team and one 'TOT' row, or multiple teams and multiple 'TOT' rows.

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#1      bob      7  TOT
#2      bob      1  ABC
#3  charlie      3  GHI
#4     mary      4  JKL
#5     mary      5  MNO
#6     mary      6  TOT

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#3  charlie      3  GHI
#1      bob      7  TOT
#6     mary      6  TOT

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.