How to remove duplicates from a dataframe based on the column with string values

Question

I am trying to remove duplicates based on the column item_id from a dataframe df.

df :

    date        code             item_id
0   20210325    30893       001 002 003 003 
1   20210325    10030       001 002 003 003

In this df the item_id is as follows:

These are all item_ids separated by one or more spaces.
    0 ->  "001  002 003 003"  #here there is an extra space after 001, rest is same.  
    1 ->  "001 002 003 003"

I am using the following function to remove the duplicates.

def create_data_file_removed_duplicate_item(packing_data):
    
    print('start removing duplicated item data')
    print('data count before removing duplication: ' + str(len(packing_data)))
    
    # check null
    packing_data = packing_data[~packing_data['item_id'].isnull()]
    
    # sorting item id
    packing_data['item_id_list'] = packing_data['item_id'].str.split('  ').apply(sorted)\
        .apply(lambda item_id_list: ''.join([item_id.replace(' ', '') + ' ' for item_id in item_id_list]))
   
    # drop duplicate item_id
    packing_data.drop_duplicates(keep='last', inplace=True, subset=['item_id_list'])
    packing_data = packing_data.drop(columns=['item_id_list'])

    # create non duplicate item data file
    print('data count after removing duplication: ' + str(len(packing_data)))
    
    return packing_data

I am unable to remove the duplicates although there rows 0 and 1 have similar item_id.
I have some other cases where this function removes duplicates where the item_id is as follows:

0 ->  "001 002 003 003".  # there is no space after 001. These are all item_ids separated by one or more spaces.
1 ->  "001 002 003 003"

Expected output:

     date        code            item_id
0   20210325    10030       001 002 003 003

Is there a way where I can remove the duplicates even if the item_id is separated by multiple spaces?

chitown88 · Accepted Answer · 2021-06-16 12:20:47Z

2

You can apply a function to the column that will make the item_id "uniform", then can drop_duplicates()

import pandas as pd


df = pd.DataFrame({'date':['20210325','20210325'],
                   'code':['30893','10030'],
                   'item_id':['001 002 003 003','001    002 003 003']})

df['item_id'] = df['item_id'].apply(lambda x: ' '.join(sorted(x.split())).strip())
df = df.drop_duplicates(subset='item_id', keep="last")

Output:

print(df)
       date   code          item_id
1  20210325  10030  001 002 003 003

edited Jun 16, 2021 at 12:20

answered Jun 15, 2021 at 14:29

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Sam Over a year ago

The original question had a sorting function.

packing_data['item_id'].str.split('  ').apply(sorted)\         .apply(lambda item_id_list: ''.join([item_id.replace(' ', '') + ' ' for item_id in item_id_list]))

Is there a way to use a sort

Sam Over a year ago

if 'item_id':['001 002 005 003 003', '001 002 003 003 005']}) How can we sort and these and still have the uniform item_id

chitown88 Over a year ago

@Sam, sorry didn't realize that part of it. I just updated it. It's a simple as a sorted() method to the split list.

ALollz · Accepted Answer · 2021-06-15 14:29:05Z

2

Create a temporary column that removes the spaces, then drop duplicates based on that column.

import pandas as pd
df = pd.DataFrame({'date': [20210325, 20210325],
                   'code': [30893, 10030],
                   'item_id': ['001  002 003 003', '001 002 003 003']})


df = (df.assign(t=df['item_id'].str.replace(' ', ''))
        .drop_duplicates('t').drop(columns='t'))

print(df)
#       date   code           item_id
#0  20210325  30893  001  002 003 003

answered Jun 15, 2021 at 14:29

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

Collectives™ on Stack Overflow

How to remove duplicates from a dataframe based on the column with string values

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related