1

Running into some difficulty trying to drop correct duplicates from a dataframe.

I have the following example:

import numpy as np
import pandas as pd


test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10', 
        '2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
        'value' : [123, '', 324, '', '', '', 321],}

df = pd.DataFrame(data=test)

The output can be seen below:

                  date value
0  2012-10-12 10:10:10   123
1  2012-10-12 10:10:10      
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
4  2012-11-02 16:08:07      
5  2012-12-12 23:45:21      
6  2012-12-12 23:45:21   321

My desired output after dropping duplicate dates is as shown below:

                  date value
0  2012-10-12 10:10:10   123
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
6  2012-12-12 23:45:21   321 

However, my attempts to date have been unsuccessful as shown below:

Attempt 1:-

df = df.drop_duplicates(subset='date')

                  date value
0  2012-10-12 10:10:10   123
2  2012-10-19 10:55:10   324
3  2012-11-02 16:08:07      
5  2012-12-12 23:45:21      

Attempt 2:-

df = df.drop_duplicates(subset='date', keep='last')

                  date value
1  2012-10-12 10:10:10      
2  2012-10-19 10:55:10   324
4  2012-11-02 16:08:07      
6  2012-12-12 23:45:21   321

Please can you assist with helping me reach my desired output. Many thanks in advance

1
  • What is the "keep criteria"? I mean, which duplicates remain in the dataframe? the last occurrence? or is there something with value columns? Commented Dec 24, 2020 at 16:28

3 Answers 3

3

One approach is to mask the empty strings in the column value, then groupby on date and aggregate using first:

df['value'].mask(df['value'].eq('')).groupby(df['date']).first().fillna('').reset_index()

Alternatively you can mask the empty strings in the column value and assign it to temporary column key, then sort the dataframe on columns date and key, followed by drop_duplicates:

df['key'] = df['value'].mask(df['value'].eq(''))
df.sort_values(['date', 'key']).drop_duplicates('date').drop('key', 1)

Result:

                  date value
0  2012-10-12 10:10:10   123
1  2012-10-19 10:55:10   324
2  2012-11-02 16:08:07      
3  2012-12-12 23:45:21   321
Sign up to request clarification or add additional context in comments.

1 Comment

the added benefit of having the index reset also, covers an issue I forgot to mention. Great stuff
1
import numpy as np
import pandas as pd


test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10', 
        '2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
        'value' : [123, np.nan, 324,  np.nan,  np.nan,  np.nan, 321],}

This should work out!

df = pd.DataFrame(data=test)
df.sort_values(by = "value", inplace = True)
df = df.drop_duplicates(subset='date')
df = df.replace(np.nan, '', regex=True)
df.sort_index()

Output comes out like below:

        date    value
0   2012-10-12 10:10:10 123
2   2012-10-19 10:55:10 324
3   2012-11-02 16:08:07 
6   2012-12-12 23:45:21 321  

3 Comments

thanks Srishti, however, the order seems to be skewed
Hey @windwalker, just added a sorting statement, to maintain the order, please check the edit, hope it helps!
the df.sort_index() makes it much cleaner now
0
import pandas as pd


test = {'date': ['2012-10-12 10:10:10', '2012-10-12 10:10:10', '2012-10-19 10:55:10', 
        '2012-11-02 16:08:07', '2012-11-02 16:08:07', '2012-12-12 23:45:21', '2012-12-12 23:45:21'],
        'value' : [123, '', 324, '', '', '', 321],}

df = pd.DataFrame(data=test)

df["value_not_empty"] = df['value'].map(bool)
df = df.sort_values("value_not_empty")
df = df.drop(columns=["value_not_empty"])
df = df.drop_duplicates('date', keep='last')
df

enter image description here

1 Comment

Hi Ismael, like the @Srishti Garg solution, the order seems to be skewed but I am grateful for all the help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.