Get unique strings from multiple columns in Pandas Dataframe

Question

I have a dataframe like so:

data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299']} 

df = pd.DataFrame(data)
df


ID
nan, -1
647, 47
603, 603
6036299, 6036299

How can I create a new column that shows only the unique values in column ID?

Output:

 ID                       unique
nan, -1                    nan, -1
647, 47                    647, 47
603, 603                   603
6036299, 6036299           6036299

I have tried df['unique'] = df.ID.unique() & df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']] but they don't work.

If 6036299, 6036299 is changed to 6036299, 6036299, 47 then expected output is 6036299 or 6036299, 47 ? — jezrael
– jezrael, Commented Jan 20, 2020 at 11:54

jezrael · Accepted Answer · 2020-01-20 11:53:36Z

If order is not important your second solution working nice:

df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']]
print (df)
                 ID   unique
0           nan, -1  -1, nan
1           647, 47  647, 47
2          603, 603      603
3  6036299, 6036299  6036299

If order is important then use dict.fromkeys for remove duplicates:

df['unique'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
print (df)
                 ID   unique
0           nan, -1  nan, -1
1           647, 47  647, 47
2          603, 603      603
3  6036299, 6036299  6036299

If want remove duplicates of all values it is more complicated - split values, reshape by stack, remove duplicates and join groups back:

data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299, 47']} 

df = pd.DataFrame(data)

df['unique11'] = [', '.join(set(x.split(', '))) for x in df['ID']]
df['unique12'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
df['unique2'] = (df['ID'].str.split(', ', expand=True)
                        .stack()
                        .drop_duplicates()
                        .groupby(level=0)
                        .agg(', '.join))
print (df)

                     ID     unique11     unique12  unique2
0               nan, -1      -1, nan      nan, -1  nan, -1
1               647, 47      647, 47      647, 47  647, 47
2              603, 603          603          603      603
3  6036299, 6036299, 47  47, 6036299  6036299, 47  6036299

sammywemmy · Accepted Answer · 2020-01-20 11:57:51Z

1

This is just verbose, albeit another option, and not ordered:

df['unique'] = df.ID
              .str.strip()
              .str.split(', ')
              .apply(set)
              .apply(lambda x: ', '.join(x))

       ID                unique
0   nan, -1              -1, nan
1   647, 47              47, 647
2   603, 603             603
3   6036299, 6036299    6036299

answered Jan 20, 2020 at 11:57

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Collectives™ on Stack Overflow

Get unique strings from multiple columns in Pandas Dataframe

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related