Python code to drop specific duplications in dataframe

Question

I am writing a python code to drop specific rows of 'desc' col depending upon the 'label' col. I focus on 2 labels: 'L1' and 'arc'.

Some L1 labels have same 'desc' as arc labels. For such L1, I want to rename label as 'L1arc' and drop the arc row since its the duplication. I also do not want to remove all duplications in desc col though they are same description with different labels.

The dataframe looks like below:

    desc                             label  lang
0   The sky is blue                  L1      en         
1   Design tech                      L2      en
2   Design tech                      L3      en 
3   Silverline clouds                PM      en 
4   No event data                    L1      en
5   TouchStatus shall be calculated  L1      en
6   160 fps                          arc     en
7   Failure detection specified      L1      en
8   160 fps                          L1      en
9   No event data                    arc     en
10   Design tech                      L1     en

Here is the code I tried:

sample.sort_values('label', ascending=False).drop_duplicates('desc').sort_index()

The problem is, above code removes duplication of other labels L2 and L3 which I want to retain, including L1 also. How to remove specific duplications in a col?

Expected output:

   desc                          label
0   The sky is blue                  L1
1   Design tech                      L2
2   Design tech                      L3
3   Design tech                      L1
4   Silverline clouds                PM
5   No event data                    L1arc
6   TouchStatus shall be calculated  L1   
7   Failure detection specified      L1
8   160 fps                          L1arc

Can you add the constraint (including L1 should be retained) in your question? (I can't edit since the edit queue is full...) Because you add a comment in my solution and wish all L1 should be retained. — Carson
– Carson, Commented Jan 13, 2021 at 8:29

Carson · Accepted Answer · 2021-01-13 08:32:35Z

1

import pandas as pd

list_row_data = [
    ['The sky is blue', 'L1', 'en'],
    ['Design tech', 'L2', 'en'],
    ['Design tech', 'L3', 'en'],
    ['Silverline clouds', 'PM', 'en'],
    ['No event data', 'L1', 'en'],
    ['TochStatus shall be calculated', 'L1', 'en'],
    ['160 fps', 'arc', 'en'],
    ['Failure detection specified', 'L1', 'en'],
    ['160 fps', 'L1', 'en'],
    ['No event data', 'arc', 'en'],
    ['Design tech', 'L1', 'en'],
]

df = pd.DataFrame(list_row_data, columns=['desc', 'label', 'lang'])

# find duplicated and (not in [L1, L2, L3])
df_ignore_row = df[
    (df.duplicated(subset=['desc'], keep=False)) &
    (~df['label'].isin(['L1', 'L2', 'L3']))  # because you wish all L1, L2, L3 should be retained.
]

for idx, (desc, label, lang) in df_ignore_row.iterrows():
    # modify label if value.desc is in desc then (item.desc + label) otherwise do not change.
    df.label = df[['desc', 'label']].apply(lambda series: series.label if series.desc != desc else series.label+label, axis=1)

df = df.drop(df_ignore_row.index)
df_focus = df[['desc', 'label']]  # Focus on the columns you are interested in.
print(df_focus.sort_values('desc', ascending=False).reset_index(drop=True))

result:

                             desc  label
0  TochStatus shall be calculated     L1
1                 The sky is blue     L1
2               Silverline clouds     PM
3                   No event data  L1arc
4     Failure detection specified     L1
5                     Design tech     L2
6                     Design tech     L3
7                     Design tech     L1
8                         160 fps  L1arc

edited Jan 13, 2021 at 8:32

answered Jan 5, 2021 at 8:18

Carson

8,8902 gold badges62 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

sinG20 Over a year ago

I get this error: TypeError: <lambda>() got an unexpected keyword argument 'axis'. Not able to apply 'axis' arg to lambda()

Carson Over a year ago

My machine works (although there is a warning). I changed the way I wrote it, and the warning is gone on my machine, and it also works. Please try again and check does it is available. If not, please provide your Python version and pandas version.

sinG20 Over a year ago

Thanks Carson. My dataset is little more complicated. I have edited the question - inserted row index 3 with label L1. And there are several columns in the dataset, which i reflected here with adding one more new col 'lang'. So I only need to identify L1 desc similar to arc and rename L1->L1arc and then delete the 'arc' labels. All other labels including L1 should be retained.

sinG20 Over a year ago

My python version 3.8.3 and pandas 1.0.5

Carson Over a year ago

I have edited the answer, and now it looks like the result is what you expect. I am using python 3.7.3, but I think it has nothing to do with the python version, pandas 1.2.0. If you still can't run it, update the panda's version, and you will run successfully.

Collectives™ on Stack Overflow

Python code to drop specific duplications in dataframe

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related