Drop duplicates where two columns have same values - pandas

Question

I'm aiming to drop duplicates values in a df. However, I want to drop where two values in separate columns are the same. Using below, I want to drop where Value and Item are duplicates. However, I want to keep the row where ['Group1'] == df['Group'].

Note: df = df.drop_duplicates(['Value', 'Item']) will not always be ideal as it depends on the ordering in Group. For instance, duplicates found in Item 80.0 and 260.0, the first row should be kept, but the second row should be kept for Item 310.0. I don't want to sort values here either as the strings could change. For example the groups could be Blue and Green, which would alter the intended output.

df = pd.DataFrame({  
    'Value' : ['X','X','Y','Z','D','D','E','E','X'],    
    'Item' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,300.0,310.0],  
    'Group' : ['Red','Green','Red','Green','Red','Green','Green','Red','Green'],            
    'Group1' : ['Red','Red','Red','Red','Red','Red','Red','Red','Red'],       
    'Group2' : ['Green','Green','Green','Green','Green','Green','Green','Green','Green'],                                  
    })

df = df[df['Group1'] == df['Group']].drop_duplicates(subset = ['Item','Value'])

If I perform df = df.drop_duplicates(['Value', 'Item']), the output is:

  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
2     Y  200.0    Red    Red  Green
3     Z  210.0  Green    Red  Green
4     D  260.0    Red    Red  Green
6     E  300.0  Green    Red  Green # incorrect
8     X  310.0  Green    Red  Green

intended output:

  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
1     Y  200.0    Red    Red  Green
2     Z  210.0  Green    Red  Green
3     D  260.0    Red    Red  Green
4     E  300.0    Red    Red  Green
5     X  310.0  Green    Red  Green

I dont think your explanation is clear enough; besides df.drop_duplicates(['Value', 'Item']) seems to get your expected output — sammywemmy
– sammywemmy, Commented Mar 16, 2021 at 0:45
@Chopin Does it mean that you need to keep a record with priority Group==Group1 and then Group!=Group1 — Deven Ramani
– Deven Ramani, Commented Mar 16, 2021 at 5:45
Looks like you are expecting Red as first record. Your dataset has Green as first record. If you want to swap, then sort the data by descending order. Then you will get Red first and Green second for 'Group'. That will solve your problem. Alternate, provide additional details for us to help you — Joe Ferndz
– Joe Ferndz, Commented Mar 16, 2021 at 6:35
@JoeFerndz, I've added a note in the question explaining why this may not work long term. I've used dummy data and the strings will in Group will change. So sorting won't be ideal as I'll need to continually change it to ascending or descending. — Chopin
– Chopin, Commented Mar 16, 2021 at 22:07

Ynjxsjmh · Accepted Answer · 2021-03-17 04:04:27Z

df1 = df.drop_duplicates(subset = ['Item','Value'])
df2 = df[df['Group'] == df['Group1']]

Dataframe df1 drop duplicates row on columns Item and Value.
Dataframe df2 keeps the rows where the value between column Group and column Group1 is the same.

I want to keep the row where ['Group1'] == df['Group'].

One left thing you need do is to replace values of dataframe df1 with the values of dataframe df2, if their both Item and Value column values are the same.

pandas.DataFrame.update() can modify in place using non-NA values from another DataFrame. You can use it like:

df1.set_index(['Value', 'Item'], inplace=True)
df1.update(df2.set_index(['Value', 'Item']))
df1.reset_index(inplace=True) # to recover the initial structure

# print(df1)
  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
1     Y  200.0    Red    Red  Green
2     Z  210.0  Green    Red  Green
3     D  260.0    Red    Red  Green
4     E  300.0    Red    Red  Green
5     X  310.0  Green    Red  Green

Besides update, you can use the index of the dataframe df2 to slice df1 and then assign.

df1.set_index(['Value', 'Item'], inplace=True)
df2.set_index(['Value', 'Item'], inplace=True)

df1.loc[df2.index] = df2

df1.reset_index(inplace=True)

Deven Ramani · Accepted Answer · 2021-03-16 06:03:11Z

1

df = pd.concat([df[df.Group == df.Group1],df[df.Group != df.Group1]]).drop_duplicates(subset = ['Item','Value']).sort_index()

Output

  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
2     Y  200.0    Red    Red  Green
3     Z  210.0  Green    Red  Green
4     D  260.0    Red    Red  Green
7     E  300.0    Red    Red  Green
8     X  310.0  Green    Red  Green

edited Mar 16, 2021 at 6:03

answered Mar 16, 2021 at 5:54

Deven Ramani

7815 silver badges10 bronze badges

Collectives™ on Stack Overflow

Drop duplicates where two columns have same values - pandas

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related