df.drop_duplicates python

Question

Running into some difficulty trying to drop correct duplicates from a dataframe.

I have the following example:

import numpy as np
import pandas as pd


test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10', 
        '2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
        'value' : [123, '', 324, '', '', '', 321],}

df = pd.DataFrame(data=test)

The output can be seen below:

                  date value
0  2012-10-12 10:10:10   123
1  2012-10-12 10:10:10      
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
4  2012-11-02 16:08:07      
5  2012-12-12 23:45:21      
6  2012-12-12 23:45:21   321

My desired output after dropping duplicate dates is as shown below:

                  date value
0  2012-10-12 10:10:10   123
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
6  2012-12-12 23:45:21   321

However, my attempts to date have been unsuccessful as shown below:

Attempt 1:-

df = df.drop_duplicates(subset='date')

                  date value
0  2012-10-12 10:10:10   123
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
5  2012-12-12 23:45:21

Attempt 2:-

df = df.drop_duplicates(subset='date', keep='last')

                  date value
1  2012-10-12 10:10:10      
2  2012-10-19 10:55:10   324
4  2012-11-02 16:08:07      
6  2012-12-12 23:45:21   321

Please can you assist with helping me reach my desired output. Many thanks in advance

What is the "keep criteria"? I mean, which duplicates remain in the dataframe? the last occurrence? or is there something with value columns? — Pablo C
– Pablo C, Commented Dec 24, 2020 at 16:28

Shubham Sharma · Accepted Answer · 2020-12-24 16:51:51Z

3

One approach is to mask the empty strings in the column value, then groupby on date and aggregate using first:

df['value'].mask(df['value'].eq('')).groupby(df['date']).first().fillna('').reset_index()

Alternatively you can mask the empty strings in the column value and assign it to temporary column key, then sort the dataframe on columns date and key, followed by drop_duplicates:

df['key'] = df['value'].mask(df['value'].eq(''))
df.sort_values(['date', 'key']).drop_duplicates('date').drop('key', 1)

Result:

                  date value
0  2012-10-12 10:10:10   123
1  2012-10-19 10:55:10   324
2  2012-11-02 16:08:07      
3  2012-12-12 23:45:21   321

answered Dec 24, 2020 at 16:51

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

windwalker Over a year ago

the added benefit of having the index reset also, covers an issue I forgot to mention. Great stuff

srishtigarg · Accepted Answer · 2020-12-24 21:30:05Z

1

import numpy as np
import pandas as pd


test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10', 
        '2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
        'value' : [123, np.nan, 324,  np.nan,  np.nan,  np.nan, 321],}

This should work out!

df = pd.DataFrame(data=test)
df.sort_values(by = "value", inplace = True)
df = df.drop_duplicates(subset='date')
df = df.replace(np.nan, '', regex=True)
df.sort_index()

Output comes out like below:

        date    value
0   2012-10-12 10:10:10 123
2   2012-10-19 10:55:10 324
3   2012-11-02 16:08:07 
6   2012-12-12 23:45:21 321

edited Dec 24, 2020 at 21:30

answered Dec 24, 2020 at 16:23

srishtigarg

1,21212 silver badges24 bronze badges

3 Comments

windwalker Over a year ago

thanks Srishti, however, the order seems to be skewed

srishtigarg Over a year ago

Hey @windwalker, just added a sorting statement, to maintain the order, please check the edit, hope it helps!

windwalker Over a year ago

the df.sort_index() makes it much cleaner now

Ismael EL ATIFI · Accepted Answer · 2020-12-24 16:28:58Z

0

import pandas as pd


test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10', 
        '2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
        'value' : [123, '', 324, '', '', '', 321],}

df = pd.DataFrame(data=test)

df["value_not_empty"] = df['value'].map(bool)
df = df.sort_values("value_not_empty")
df = df.drop(columns=["value_not_empty"])
df = df.drop_duplicates('date', keep='last')
df

answered Dec 24, 2020 at 16:28

Ismael EL ATIFI

2,12822 silver badges16 bronze badges

1 Comment

windwalker Over a year ago

Hi Ismael, like the @Srishti Garg solution, the order seems to be skewed but I am grateful for all the help

Collectives™ on Stack Overflow

df.drop_duplicates python

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related