1

I have a dataframe like so:

data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299']} 

df = pd.DataFrame(data)
df


ID
nan, -1
647, 47
603, 603
6036299, 6036299

How can I create a new column that shows only the unique values in column ID?

Output:

 ID                       unique
nan, -1                    nan, -1
647, 47                    647, 47
603, 603                   603
6036299, 6036299           6036299

I have tried df['unique'] = df.ID.unique() & df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']] but they don't work.

1
  • If 6036299, 6036299 is changed to 6036299, 6036299, 47 then expected output is 6036299 or 6036299, 47 ? Commented Jan 20, 2020 at 11:54

2 Answers 2

3

If order is not important your second solution working nice:

df['unique'] = [', '.join(set(x.split(', '))) for x in df['ID']]
print (df)
                 ID   unique
0           nan, -1  -1, nan
1           647, 47  647, 47
2          603, 603      603
3  6036299, 6036299  6036299

If order is important then use dict.fromkeys for remove duplicates:

df['unique'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
print (df)
                 ID   unique
0           nan, -1  nan, -1
1           647, 47  647, 47
2          603, 603      603
3  6036299, 6036299  6036299

If want remove duplicates of all values it is more complicated - split values, reshape by stack, remove duplicates and join groups back:

data = {'ID':['nan, -1', '647, 47', '603, 603', '6036299, 6036299, 47']} 

df = pd.DataFrame(data)

df['unique11'] = [', '.join(set(x.split(', '))) for x in df['ID']]
df['unique12'] = [', '.join(dict.fromkeys(x.split(', ')).keys()) for x in df['ID']]
df['unique2'] = (df['ID'].str.split(', ', expand=True)
                        .stack()
                        .drop_duplicates()
                        .groupby(level=0)
                        .agg(', '.join))
print (df)

                     ID     unique11     unique12  unique2
0               nan, -1      -1, nan      nan, -1  nan, -1
1               647, 47      647, 47      647, 47  647, 47
2              603, 603          603          603      603
3  6036299, 6036299, 47  47, 6036299  6036299, 47  6036299
Sign up to request clarification or add additional context in comments.

Comments

1

This is just verbose, albeit another option, and not ordered:

df['unique'] = df.ID
              .str.strip()
              .str.split(', ')
              .apply(set)
              .apply(lambda x: ', '.join(x))

       ID                unique
0   nan, -1              -1, nan
1   647, 47              47, 647
2   603, 603             603
3   6036299, 6036299    6036299

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.