2

I am writing a python code to drop specific rows of 'desc' col depending upon the 'label' col. I focus on 2 labels: 'L1' and 'arc'.

Some L1 labels have same 'desc' as arc labels. For such L1, I want to rename label as 'L1arc' and drop the arc row since its the duplication. I also do not want to remove all duplications in desc col though they are same description with different labels.

The dataframe looks like below:

    desc                             label  lang
0   The sky is blue                  L1      en         
1   Design tech                      L2      en
2   Design tech                      L3      en 
3   Silverline clouds                PM      en 
4   No event data                    L1      en
5   TouchStatus shall be calculated  L1      en
6   160 fps                          arc     en
7   Failure detection specified      L1      en
8   160 fps                          L1      en
9   No event data                    arc     en
10   Design tech                      L1     en  

Here is the code I tried:

sample.sort_values('label', ascending=False).drop_duplicates('desc').sort_index()

The problem is, above code removes duplication of other labels L2 and L3 which I want to retain, including L1 also. How to remove specific duplications in a col?

Expected output:

   desc                          label
0   The sky is blue                  L1
1   Design tech                      L2
2   Design tech                      L3
3   Design tech                      L1
4   Silverline clouds                PM
5   No event data                    L1arc
6   TouchStatus shall be calculated  L1   
7   Failure detection specified      L1
8   160 fps                          L1arc
2
  • Can you add the constraint (including L1 should be retained) in your question? (I can't edit since the edit queue is full...) Because you add a comment in my solution and wish all L1 should be retained. Commented Jan 13, 2021 at 8:29
  • Yup question is clearer now. Thanks for the suggestion Commented Jan 14, 2021 at 4:11

1 Answer 1

1
import pandas as pd

list_row_data = [
    ['The sky is blue', 'L1', 'en'],
    ['Design tech', 'L2', 'en'],
    ['Design tech', 'L3', 'en'],
    ['Silverline clouds', 'PM', 'en'],
    ['No event data', 'L1', 'en'],
    ['TochStatus shall be calculated', 'L1', 'en'],
    ['160 fps', 'arc', 'en'],
    ['Failure detection specified', 'L1', 'en'],
    ['160 fps', 'L1', 'en'],
    ['No event data', 'arc', 'en'],
    ['Design tech', 'L1', 'en'],
]

df = pd.DataFrame(list_row_data, columns=['desc', 'label', 'lang'])

# find duplicated and (not in [L1, L2, L3])
df_ignore_row = df[
    (df.duplicated(subset=['desc'], keep=False)) &
    (~df['label'].isin(['L1', 'L2', 'L3']))  # because you wish all L1, L2, L3 should be retained.
]

for idx, (desc, label, lang) in df_ignore_row.iterrows():
    # modify label if value.desc is in desc then (item.desc + label) otherwise do not change.
    df.label = df[['desc', 'label']].apply(lambda series: series.label if series.desc != desc else series.label+label, axis=1)

df = df.drop(df_ignore_row.index)
df_focus = df[['desc', 'label']]  # Focus on the columns you are interested in.
print(df_focus.sort_values('desc', ascending=False).reset_index(drop=True))

result:

                             desc  label
0  TochStatus shall be calculated     L1
1                 The sky is blue     L1
2               Silverline clouds     PM
3                   No event data  L1arc
4     Failure detection specified     L1
5                     Design tech     L2
6                     Design tech     L3
7                     Design tech     L1
8                         160 fps  L1arc
Sign up to request clarification or add additional context in comments.

5 Comments

I get this error: TypeError: <lambda>() got an unexpected keyword argument 'axis'. Not able to apply 'axis' arg to lambda()
My machine works (although there is a warning). I changed the way I wrote it, and the warning is gone on my machine, and it also works. Please try again and check does it is available. If not, please provide your Python version and pandas version.
Thanks Carson. My dataset is little more complicated. I have edited the question - inserted row index 3 with label L1. And there are several columns in the dataset, which i reflected here with adding one more new col 'lang'. So I only need to identify L1 desc similar to arc and rename L1->L1arc and then delete the 'arc' labels. All other labels including L1 should be retained.
My python version 3.8.3 and pandas 1.0.5
I have edited the answer, and now it looks like the result is what you expect. I am using python 3.7.3, but I think it has nothing to do with the python version, pandas 1.2.0. If you still can't run it, update the panda's version, and you will run successfully.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.