0

I'm aiming to drop duplicates values in a df. However, I want to drop where two values in separate columns are the same. Using below, I want to drop where Value and Item are duplicates. However, I want to keep the row where ['Group1'] == df['Group'].

Note: df = df.drop_duplicates(['Value', 'Item']) will not always be ideal as it depends on the ordering in Group. For instance, duplicates found in Item 80.0 and 260.0, the first row should be kept, but the second row should be kept for Item 310.0. I don't want to sort values here either as the strings could change. For example the groups could be Blue and Green, which would alter the intended output.

df = pd.DataFrame({  
    'Value' : ['X','X','Y','Z','D','D','E','E','X'],    
    'Item' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,300.0,310.0],  
    'Group' : ['Red','Green','Red','Green','Red','Green','Green','Red','Green'],            
    'Group1' : ['Red','Red','Red','Red','Red','Red','Red','Red','Red'],       
    'Group2' : ['Green','Green','Green','Green','Green','Green','Green','Green','Green'],                                  
    })

df = df[df['Group1'] == df['Group']].drop_duplicates(subset = ['Item','Value'])

If I perform df = df.drop_duplicates(['Value', 'Item']), the output is:

  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
2     Y  200.0    Red    Red  Green
3     Z  210.0  Green    Red  Green
4     D  260.0    Red    Red  Green
6     E  300.0  Green    Red  Green # incorrect
8     X  310.0  Green    Red  Green

intended output:

  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
1     Y  200.0    Red    Red  Green
2     Z  210.0  Green    Red  Green
3     D  260.0    Red    Red  Green
4     E  300.0    Red    Red  Green
5     X  310.0  Green    Red  Green
5
  • I dont think your explanation is clear enough; besides df.drop_duplicates(['Value', 'Item']) seems to get your expected output Commented Mar 16, 2021 at 0:45
  • @sammywemmy, more detail has been added. Commented Mar 16, 2021 at 1:05
  • @Chopin Does it mean that you need to keep a record with priority Group==Group1 and then Group!=Group1 Commented Mar 16, 2021 at 5:45
  • Looks like you are expecting Red as first record. Your dataset has Green as first record. If you want to swap, then sort the data by descending order. Then you will get Red first and Green second for 'Group'. That will solve your problem. Alternate, provide additional details for us to help you Commented Mar 16, 2021 at 6:35
  • @JoeFerndz, I've added a note in the question explaining why this may not work long term. I've used dummy data and the strings will in Group will change. So sorting won't be ideal as I'll need to continually change it to ascending or descending. Commented Mar 16, 2021 at 22:07

2 Answers 2

1
df1 = df.drop_duplicates(subset = ['Item','Value'])
df2 = df[df['Group'] == df['Group1']]
  • Dataframe df1 drop duplicates row on columns Item and Value.
  • Dataframe df2 keeps the rows where the value between column Group and column Group1 is the same.

I want to keep the row where ['Group1'] == df['Group'].

One left thing you need do is to replace values of dataframe df1 with the values of dataframe df2, if their both Item and Value column values are the same.

pandas.DataFrame.update() can modify in place using non-NA values from another DataFrame. You can use it like:

df1.set_index(['Value', 'Item'], inplace=True)
df1.update(df2.set_index(['Value', 'Item']))
df1.reset_index(inplace=True) # to recover the initial structure
# print(df1)
  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
1     Y  200.0    Red    Red  Green
2     Z  210.0  Green    Red  Green
3     D  260.0    Red    Red  Green
4     E  300.0    Red    Red  Green
5     X  310.0  Green    Red  Green

Besides update, you can use the index of the dataframe df2 to slice df1 and then assign.

df1.set_index(['Value', 'Item'], inplace=True)
df2.set_index(['Value', 'Item'], inplace=True)

df1.loc[df2.index] = df2

df1.reset_index(inplace=True)
Sign up to request clarification or add additional context in comments.

Comments

1
df = pd.concat([df[df.Group == df.Group1],df[df.Group != df.Group1]]).drop_duplicates(subset = ['Item','Value']).sort_index()

Output

  Value   Item  Group Group1 Group2
0     X   80.0    Red    Red  Green
2     Y  200.0    Red    Red  Green
3     Z  210.0  Green    Red  Green
4     D  260.0    Red    Red  Green
7     E  300.0    Red    Red  Green
8     X  310.0  Green    Red  Green

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.