I am trying to remove duplicates based on the column item_id from a dataframe df.
df :
date code item_id
0 20210325 30893 001 002 003 003
1 20210325 10030 001 002 003 003
In this df the item_id is as follows:
These are all item_ids separated by one or more spaces.
0 -> "001 002 003 003" #here there is an extra space after 001, rest is same.
1 -> "001 002 003 003"
I am using the following function to remove the duplicates.
def create_data_file_removed_duplicate_item(packing_data):
print('start removing duplicated item data')
print('data count before removing duplication: ' + str(len(packing_data)))
# check null
packing_data = packing_data[~packing_data['item_id'].isnull()]
# sorting item id
packing_data['item_id_list'] = packing_data['item_id'].str.split(' ').apply(sorted)\
.apply(lambda item_id_list: ''.join([item_id.replace(' ', '') + ' ' for item_id in item_id_list]))
# drop duplicate item_id
packing_data.drop_duplicates(keep='last', inplace=True, subset=['item_id_list'])
packing_data = packing_data.drop(columns=['item_id_list'])
# create non duplicate item data file
print('data count after removing duplication: ' + str(len(packing_data)))
return packing_data
I am unable to remove the duplicates although there rows 0 and 1 have similar item_id.
I have some other cases where this function removes duplicates where the item_id is as follows:
0 -> "001 002 003 003". # there is no space after 001. These are all item_ids separated by one or more spaces.
1 -> "001 002 003 003"
Expected output:
date code item_id
0 20210325 10030 001 002 003 003
Is there a way where I can remove the duplicates even if the item_id is separated by multiple spaces?