How to drop duplicates in pandas dataframe but keep row based on specific column value

Question

I have a pandas dataframe with NBA player stats, and I want to drop the rows of duplicate players. There are duplicates because some players played on multiple teams for the 2020-2021 season, and I want to drop these duplicates. However, for these players that played on multiple teams, there is also a row with that player's combined stats across all teams and a team label of 'TOT', which represents the fact that that player played on 2 or more teams for the season. When I drop duplicate players, I want the row with the team of 'TOT' to remain, and all the other duplicates to be gone. I'm unsure of how to specify that I want to drop all duplicates, but keep the duplicate where df['Team'] = 'TOT'.

Here is what my dataframe looks like: Dataframe

In this example, I want to drop the duplicates of the player 'Jarrett Allen', but keep the row for Jarrett Allen where his team (Tm) is 'TOT'.

Please edit your question so all the required info is in the question itself, not in attached images. The question should be phrased as a MRE — noah
– noah, Commented Feb 1, 2021 at 19:38

Arkadiusz · Accepted Answer · 2021-02-01 19:55:58Z

2

You can just filter out unnecessary rows:

df = df.loc[(df['Rk'].duplicated(keep=False) == False) | (df['Tm'] == 'TOT'), :]

It can be understood this way: From my dataframe take all rows which are not duplicated in column 'Rk' or rows which have 'TOT' in column 'Tm'.

":" at the end means that you want to take all columns.

answered Feb 1, 2021 at 19:55

Arkadiusz

1,8751 gold badge10 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Code Pope Over a year ago

Very clever approach.

mullinscr · Accepted Answer · 2021-02-01 19:49:50Z

One way is to use a helper column. For example with the following df,

    player  stats team
0      bob      1  ABC
1    alice      2  DEF
2  charlie      3  GHI
3     mary      4  JKL
4     mary      5  MNO
5     mary      6  TOT
6      bob      7  TOT
7      bob      8  VWX

Creating a column where hte value is True if the 'team' value is 'TOT' and False otherwise results in:

import numpy as np

df['multiple_teams'] = np.where(df['team']=='TOT', 'TOT', None)

    player  stats team  multiple_teams
1    alice      2  DEF           False
0      bob      1  ABC           False
6      bob      7  TOT            True
7      bob      8  VWX           False
2  charlie      3  GHI           False
3     mary      4  JKL           False
4     mary      5  MNO           False
5     mary      6  TOT            True

Now we can use the keep parameter of the drop_duplicates() function to decide what to keep. In this case we can achieve the desired result by dropping the values based on the subset of player and multiple_teams with keep=False. This will mean that all duplicates across both columns will be removed from the df. Resulting in:

    player  stats team  multiple_teams
1    alice      2  DEF           False
6      bob      7  TOT            True
2  charlie      3  GHI           False
5     mary      6  TOT            True

ALollz · Accepted Answer · 2021-02-01 20:01:26Z

You can sort the DataFrame using the key argument, such that 'TOT' is sorted to the bottom and then drop_duplicates, keeping the last.

This guarantees that in the end there is only a single row per player, even if the data are messy and may have multiple 'TOT' rows for a single player, one team and one 'TOT' row, or multiple teams and multiple 'TOT' rows.

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#1      bob      7  TOT
#2      bob      1  ABC
#3  charlie      3  GHI
#4     mary      4  JKL
#5     mary      5  MNO
#6     mary      6  TOT

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#3  charlie      3  GHI
#1      bob      7  TOT
#6     mary      6  TOT

Collectives™ on Stack Overflow

How to drop duplicates in pandas dataframe but keep row based on specific column value

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related