I'm aiming to drop duplicates values in a df. However, I want to drop where two values in separate columns are the same. Using below, I want to drop where Value and Item are duplicates. However, I want to keep the row where ['Group1'] == df['Group'].
Note: df = df.drop_duplicates(['Value', 'Item']) will not always be ideal as it depends on the ordering in Group. For instance, duplicates found in Item 80.0 and 260.0, the first row should be kept, but the second row should be kept for Item 310.0. I don't want to sort values here either as the strings could change. For example the groups could be Blue and Green, which would alter the intended output.
df = pd.DataFrame({
'Value' : ['X','X','Y','Z','D','D','E','E','X'],
'Item' : [80.0,80.0,200.0,210.0,260.0,260.0,300.0,300.0,310.0],
'Group' : ['Red','Green','Red','Green','Red','Green','Green','Red','Green'],
'Group1' : ['Red','Red','Red','Red','Red','Red','Red','Red','Red'],
'Group2' : ['Green','Green','Green','Green','Green','Green','Green','Green','Green'],
})
df = df[df['Group1'] == df['Group']].drop_duplicates(subset = ['Item','Value'])
If I perform df = df.drop_duplicates(['Value', 'Item']), the output is:
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
2 Y 200.0 Red Red Green
3 Z 210.0 Green Red Green
4 D 260.0 Red Red Green
6 E 300.0 Green Red Green # incorrect
8 X 310.0 Green Red Green
intended output:
Value Item Group Group1 Group2
0 X 80.0 Red Red Green
1 Y 200.0 Red Red Green
2 Z 210.0 Green Red Green
3 D 260.0 Red Red Green
4 E 300.0 Red Red Green
5 X 310.0 Green Red Green
df.drop_duplicates(['Value', 'Item'])seems to get your expected outputGroup==Group1and thenGroup!=Group1Redas first record. Your dataset hasGreenas first record. If you want to swap, then sort the data by descending order. Then you will getRedfirst andGreensecond for'Group'. That will solve your problem. Alternate, provide additional details for us to help youGroupwill change. So sorting won't be ideal as I'll need to continually change it to ascending or descending.