0

Given a following dataframe:

import pandas as pd

df = pd.DataFrame({'month': [2, 2, 1, 1, 2, 10],
                   'year': [2017, 2017, 2020, 2020, 2018, 2019],
                   'sale': [60, 45, 90, 20, 28, 36],
                   'title': ['Ones', 'Twoes', 'Three', 'Four', 'Five', 'Six']})

I am trying to get duplicates in month columnn.

df[df.duplicated(subset=['month'])]

By default, keep="first"

But this is giving two occurrences for month 2.

   month  year  sale  title
1      2  2017    45  Twoes
3      1  2020    20   Four
4      2  2018    28   Five

I'm confused with the output. Am I missing something here?

1
  • the output is the duplicate values in your dataframe, not the values after dropping the duplicates. Commented Jul 6, 2021 at 7:13

2 Answers 2

2

Ouput is filter all duplicates with remove first dupe.

If need first dupes only invert mask and chain mask for filter only dupes with keep=False parameter:

df1 = df[~df.duplicated(subset=['month']) & df.duplicated(subset=['month'], keep=False)]
print (df1)
   month  year  sale  title
0      2  2017    60   Ones
2      1  2020    90  Three
Sign up to request clarification or add additional context in comments.

Comments

2

the output is the duplicate values in your dataframe, not the values after dropping the duplicates. if you want only the non duplicate values then

df.drop_duplicates(subset=['month'])

which will give you

  month  year   sale title
0   2   2017    60  Ones
2   1   2020    90  Three
5   10  2019    36  Six

you can use keep = ['first', 'last', 'None'] based on your requirement.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.