11

I have a dataframe:

id      rev     names
34e     A      su,ra,ve,ra,de,ra
45e     R      ra,su,su,ve,de
55e     G      su,ra,de
41e     M      su,de,mu,er,su

Now I need to delete the duplicates, the output should be as below:

id      rev     names
34e     A      su,ra,ve,de
45e     R      ra,su,ve,de
55e     G      su,ra,de
41e     M      su,de,mu,er

How can this be done?

1
  • 4
    Is names a string or a list? Commented Dec 3, 2018 at 14:56

3 Answers 3

17

If column contains strings first split, convert to sets and join:

df['names'] = df['names'].apply(lambda x: ','.join(set(x.split(','))))

If column contains lists converting to set and list is necessary:

df['names'] = df['names'].apply(lambda x: list(set(x)))

If order is important use pandas.unique:

df['names'] = df['names'].apply(lambda x: ','.join(pd.unique(x.split(','))))

df['names'] = df['names'].apply(lambda x: list(pd.unique(x)))
Sign up to request clarification or add additional context in comments.

1 Comment

Maybe the order is matter the pd.unique looks good enough for this type of question :-)
2

Using split follow with sorted + set , then join it back to string

df.names.str.split(',').map(lambda x : ','.join(sorted(set(x),key=x.index)))
Out[763]: 
0    su,ra,ve,de
1    ra,su,ve,de
2       su,ra,de
3    su,de,mu,er
Name: names, dtype: object

Comments

0

Assuming names is of type string:

import pandas as pd

data = [['34e', 'A', 'su,ra,ve,ra,de,ra'],
        ['45e', 'R', 'ra,su,su,ve,de'],
        ['55e', 'G', 'su,ra,de'],
        ['41e', 'M', 'su,de,mu,er,su']]

df = pd.DataFrame(data=data, columns=['id', 'rev', 'names'])

df['names'] = [','.join(set(name.split(','))) for name in df.names]
print(df)

Or if of type list:

import pandas as pd

data = [['34e', 'A', ['su', 'ra', 've', 'ra', 'de', 'ra']],
        ['45e', 'R', ['ra', 'su', 'su', 've', 'de']],
        ['55e', 'G', ['su', 'ra', 'de']],
        ['41e', 'M', ['su', 'de', 'mu', 'er', 'su']]]

df = pd.DataFrame(data=data, columns=['id', 'rev', 'names'])

df['names'] = [list(set(name)) for name in df.names]
print(df)

Output

    id rev             names
0  34e   A  [su, ra, ve, de]
1  45e   R  [su, ra, ve, de]
2  55e   G      [su, ra, de]
3  41e   M  [su, er, mu, de]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.