Python Pandas Dataframe Remove Duplicate Rows Depending on Column Value

Question

I have a pandas dataframe and I am trying to remove duplicate rows if the LE column is "AAA". If there is an "AAA" but no other rows with same ID/Name, then I want to leave the row(s) alone.

What I have

import pandas as pd

df = pd.DataFrame({'ID': [111, 222, 222, 333, 333, 444, 444, 444, 555, 555, 555, 555], 
                   'Name': ['David','Carl','Carl','Jane','Jane','Mike','Mike','Mike','Jake','Jake','Jake','Jake'],
                  'LE': ['AAA','AAA','BBB','BBB','CCC','AAA','BBB','CCC','AAA','BBB','CCC','DDD']})

print(df)

     ID   Name   LE
0   111  David  AAA
1   222   Carl  AAA
2   222   Carl  BBB
3   333   Jane  BBB
4   333   Jane  CCC
5   444   Mike  AAA
6   444   Mike  BBB
7   444   Mike  CCC
8   555   Jake  AAA
9   555   Jake  BBB
10  555   Jake  CCC
11  555   Jake  DDD

What I want


    ID   Name   LE
0  111  David  AAA
1  222   Carl  BBB
2  333   Jane  BBB
3  333   Jane  CCC
4  444   Mike  BBB
5  444   Mike  CCC
6  555   Jake  BBB
7  555   Jake  CCC
8  555   Jake  DDD

In this case, the row with "David" is left alone as there are no other instances of "David."

The row with "Jane" is left alone as there are no instances of "AAA" under the LE column.

For the rest, all instances with "AAA" under the LE column is deleted as there are duplicate data in the other two columns.

I tried using drop_duplicates() but it doesn't work due to the fact that I can only keep one of the duplicate rows if I utilize this functionality. But in this case, I want to delete only one specific row per duplicate.

tl;dr Delete duplicate rows only if the LE column has the value 'AAA'

Instead of images, put editable dataframe text. It woyld be easy to take your data and provide answers — Mohamed Thasin ah
– Mohamed Thasin ah, Commented Aug 20, 2020 at 4:51

Akshay Sehgal · Accepted Answer · 2020-08-20 19:13:46Z

1

Here's a one-liner -

First returns a boolean array with rows that have duplicated LE values and second returns a boolean array with values 'AAA'. The negation of their & that is used to boolean index df. Lastly, reset and drop index.

df[~(df.duplicated(['LE']) & (df['LE']=='AAA'))].reset_index(drop=True)

    ID   Name   LE
0  111  David  AAA
1  222   Carl  BBB
2  333   Jane  BBB
3  333   Jane  CCC
4  444   Mike  BBB
5  444   Mike  CCC
6  555   Jake  BBB
7  555   Jake  CCC
8  555   Jake  DDD

edited Aug 20, 2020 at 19:13

answered Aug 20, 2020 at 19:08

Akshay Sehgal

19.4k3 gold badges26 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ravish Jha Over a year ago

This is the most pythonic way in my opinion

Ravish Jha · Accepted Answer · 2020-08-20 19:14:51Z

0

I used a counts_dictionary for counting the number of occurrences of the names as the ID was unique so there was no point iterating over them. Then I iterated through all the rows in the DataFrame and if their count was greater than 1 and had AAA in the LE column, I dropped it

counts_dictionary = {}
for index, row in df.iterrows():
    try:
        counts_dictionary[row['Name']] = counts_dictionary[row['Name']] + 1
    except:
        counts_dictionary[row['Name']] = 1
        
for key in counts_dictionary:
    for index, row in df.iterrows():
        if row['LE'] == 'AAA' and counts_dictionary[row['Name']] > 1:
            df.drop(index, inplace=True)
                

df = df.reset_index(drop=True)

edited Aug 20, 2020 at 19:14

answered Aug 20, 2020 at 18:42

Ravish Jha

4475 silver badges31 bronze badges

Collectives™ on Stack Overflow

Python Pandas Dataframe Remove Duplicate Rows Depending on Column Value

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related