0

I have a dataframe with some duplicates that I need to remove. In the dataframe below, where the month, year and type are all the same it should keep the row with the highest sale. Eg:

df = pd.DataFrame({'month': [1, 1, 7, 10],
                   'year': [2012, 2012, 2013, 2014],
                  'type':['C','C','S','C'],
                  'sale': [55, 40, 84, 31]})

After removing duplicates and keeping the highest value of column 'sale' should look like:

df_2 = pd.DataFrame({'month': [1, 7, 10],
                   'year': [2012, 2013, 2014],
                  'type':['C','S','C'],
                  'sale': [55, 84, 31]})
1
  • df.drop_duplicates(subset= ['month', 'year', 'type'], keep= 'first') Commented Mar 2, 2021 at 16:38

2 Answers 2

1

You can use:

(df.sort_values('sale',ascending=False)
   .drop_duplicates(['month','year','type']).sort_index())

   month  year type  sale
0      1  2012    C    55
2      7  2013    S    84
3     10  2014    C    31
Sign up to request clarification or add additional context in comments.

Comments

1

You could groupby and take the max of sale:

df.groupby(['month', 'year', 'type']).max().reset_index()
    month   year    type    sale
0      1    2012      C      55
1      7    2013      S      84
2      10   2014      C      31

If you have another column, like other, than you must specify which column to take the max, in this way:

df.groupby(['month', 'year', 'type'])[['sale']].max().reset_index()
    month   year    type    sale
0      1    2012      C      55
1      7    2013      S      84
2      10   2014      C      31

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.