0

I am trying to remove duplicates based on the column item_id from a dataframe df.

df :

    date        code             item_id
0   20210325    30893       001 002 003 003 
1   20210325    10030       001 002 003 003 

In this df the item_id is as follows:

These are all item_ids separated by one or more spaces.
    0 ->  "001  002 003 003"  #here there is an extra space after 001, rest is same.  
    1 ->  "001 002 003 003"

I am using the following function to remove the duplicates.

def create_data_file_removed_duplicate_item(packing_data):
    
    print('start removing duplicated item data')
    print('data count before removing duplication: ' + str(len(packing_data)))
    
    # check null
    packing_data = packing_data[~packing_data['item_id'].isnull()]
    
    # sorting item id
    packing_data['item_id_list'] = packing_data['item_id'].str.split('  ').apply(sorted)\
        .apply(lambda item_id_list: ''.join([item_id.replace(' ', '') + ' ' for item_id in item_id_list]))
   
    # drop duplicate item_id
    packing_data.drop_duplicates(keep='last', inplace=True, subset=['item_id_list'])
    packing_data = packing_data.drop(columns=['item_id_list'])

    # create non duplicate item data file
    print('data count after removing duplication: ' + str(len(packing_data)))
    
    return packing_data

I am unable to remove the duplicates although there rows 0 and 1 have similar item_id.
I have some other cases where this function removes duplicates where the item_id is as follows:

0 ->  "001 002 003 003".  # there is no space after 001. These are all item_ids separated by one or more spaces.
1 ->  "001 002 003 003"

Expected output:

     date        code            item_id
0   20210325    10030       001 002 003 003 

Is there a way where I can remove the duplicates even if the item_id is separated by multiple spaces?

2 Answers 2

2

You can apply a function to the column that will make the item_id "uniform", then can drop_duplicates()

import pandas as pd


df = pd.DataFrame({'date':['20210325','20210325'],
                   'code':['30893','10030'],
                   'item_id':['001 002 003 003','001    002 003 003']})

df['item_id'] = df['item_id'].apply(lambda x: ' '.join(sorted(x.split())).strip())
df = df.drop_duplicates(subset='item_id', keep="last")

Output:

print(df)
       date   code          item_id
1  20210325  10030  001 002 003 003
Sign up to request clarification or add additional context in comments.

3 Comments

The original question had a sorting function. packing_data['item_id'].str.split(' ').apply(sorted)\ .apply(lambda item_id_list: ''.join([item_id.replace(' ', '') + ' ' for item_id in item_id_list])) Is there a way to use a sort
if 'item_id':['001 002 005 003 003', '001 002 003 003 005']}) How can we sort and these and still have the uniform item_id
@Sam, sorry didn't realize that part of it. I just updated it. It's a simple as a sorted() method to the split list.
2

Create a temporary column that removes the spaces, then drop duplicates based on that column.

import pandas as pd
df = pd.DataFrame({'date': [20210325, 20210325],
                   'code': [30893, 10030],
                   'item_id': ['001  002 003 003', '001 002 003 003']})


df = (df.assign(t=df['item_id'].str.replace(' ', ''))
        .drop_duplicates('t').drop(columns='t'))

print(df)
#       date   code           item_id
#0  20210325  30893  001  002 003 003

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.