1

I have a pandas dataframe and I am trying to remove duplicate rows if the LE column is "AAA". If there is an "AAA" but no other rows with same ID/Name, then I want to leave the row(s) alone.

What I have

import pandas as pd

df = pd.DataFrame({'ID': [111, 222, 222, 333, 333, 444, 444, 444, 555, 555, 555, 555], 
                   'Name': ['David','Carl','Carl','Jane','Jane','Mike','Mike','Mike','Jake','Jake','Jake','Jake'],
                  'LE': ['AAA','AAA','BBB','BBB','CCC','AAA','BBB','CCC','AAA','BBB','CCC','DDD']})

print(df)

     ID   Name   LE
0   111  David  AAA
1   222   Carl  AAA
2   222   Carl  BBB
3   333   Jane  BBB
4   333   Jane  CCC
5   444   Mike  AAA
6   444   Mike  BBB
7   444   Mike  CCC
8   555   Jake  AAA
9   555   Jake  BBB
10  555   Jake  CCC
11  555   Jake  DDD

What I want


    ID   Name   LE
0  111  David  AAA
1  222   Carl  BBB
2  333   Jane  BBB
3  333   Jane  CCC
4  444   Mike  BBB
5  444   Mike  CCC
6  555   Jake  BBB
7  555   Jake  CCC
8  555   Jake  DDD

In this case, the row with "David" is left alone as there are no other instances of "David."

The row with "Jane" is left alone as there are no instances of "AAA" under the LE column.

For the rest, all instances with "AAA" under the LE column is deleted as there are duplicate data in the other two columns.

I tried using drop_duplicates() but it doesn't work due to the fact that I can only keep one of the duplicate rows if I utilize this functionality. But in this case, I want to delete only one specific row per duplicate.

tl;dr Delete duplicate rows only if the LE column has the value 'AAA'

2
  • Instead of images, put editable dataframe text. It woyld be easy to take your data and provide answers Commented Aug 20, 2020 at 4:51
  • Thank you, made the adjustments. Commented Aug 20, 2020 at 18:21

2 Answers 2

1

Here's a one-liner -

First returns a boolean array with rows that have duplicated LE values and second returns a boolean array with values 'AAA'. The negation of their & that is used to boolean index df. Lastly, reset and drop index.

df[~(df.duplicated(['LE']) & (df['LE']=='AAA'))].reset_index(drop=True)
    ID   Name   LE
0  111  David  AAA
1  222   Carl  BBB
2  333   Jane  BBB
3  333   Jane  CCC
4  444   Mike  BBB
5  444   Mike  CCC
6  555   Jake  BBB
7  555   Jake  CCC
8  555   Jake  DDD
Sign up to request clarification or add additional context in comments.

1 Comment

This is the most pythonic way in my opinion
0

I used a counts_dictionary for counting the number of occurrences of the names as the ID was unique so there was no point iterating over them. Then I iterated through all the rows in the DataFrame and if their count was greater than 1 and had AAA in the LE column, I dropped it

counts_dictionary = {}
for index, row in df.iterrows():
    try:
        counts_dictionary[row['Name']] = counts_dictionary[row['Name']] + 1
    except:
        counts_dictionary[row['Name']] = 1
        
for key in counts_dictionary:
    for index, row in df.iterrows():
        if row['LE'] == 'AAA' and counts_dictionary[row['Name']] > 1:
            df.drop(index, inplace=True)
                

df = df.reset_index(drop=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.