1

I am trying to remove the duplicate strings in a list of strings under a column in a Pandas DataFrame.

For example; the list value of:

[btc, btc, btc]

Should be;

[btc]

I have tried multiple methods however, none seems to be working as I am unable access the string values in the list. Any help is much appreciated.

DataFrame:

          dollar_sign  followers_count  \
0                   [btc]            35946
1                   [btc]            35946
2                   [btc]            35946
3                   [nav]            35946
4         [btc, btc, btc]            35946

Access the list of strings under a column

for row in df_twitter['dollar_sign']:
    print row

Output:

[btc]
[btc]
[btc]
[nav]
[btc, btc, btc]

4 Answers 4

3

From the information revealed, I believe OP's df is actually not full of list of strings but strings that look like a list.

From the OP's print result, we see

[btc]
[btc]
[nav]
[btc, btc,btc]

However, if it is of lists of strings, it should yield

['btc']
['btc']
['btc']
['nav']
['btc', 'btc', 'btc']

Solution:

df = pd.DataFrame({
        'dollar_sign':['[btc]','[btc]','[btc]','[nav]','[btc, btc, btc]'],
        'followers_count':[35946,35946,35946,35946,35946]}
     )


df.dollar_sign.str[1:-1].str.split(",\s").map(set)

0    {btc}
1    {btc}
2    {btc}
3    {nav}
4    {btc}
Name: dollar_sign, dtype: object
  • .str[1:-1] removes [ and ].

  • str.split(",\s") splits with ", ", a comma and a space. (Assuming the strings use ", " as the delimiter, otherwise, you may need "\s*,\s*" or something even more sophisticated.)

  • map(set) turns each list into a set.
Sign up to request clarification or add additional context in comments.

Comments

3

You can use sets. A set will take out the duplicates.

So, as an example, keeping the style of the output:

for row in df_twitter['dollar_sign']:
    print list(set(row))

Output:

[btc]
[btc]
[btc]
[nav]
[btc]

4 Comments

I think this is it! Would this update the original dataframe column to these values as well?
No, other answers in this question will show you how to modify them, this is only for displaying.
This didn't work - it is giving me this: [c, [, b, ], t]
This answer is not wrong, and the possible reason that you didn't get what you wanted is as Tai pointed out - what you have in each cell is not a real list, but a string that has [] in it. Otherwise Mangu's code should works well.
2

You can using list with map , and set can get the unique value

df['dollar_sign']=list(map(set,df['dollar_sign']))
df
Out[1068]: 
  dollar_sign  followers_count
0       {btc}            35946
1       {btc}            35946
2       {btc}            35946
3       {nav}            35946
4       {btc}            35946

This is how I create the df

df=pd.DataFrame({'dollar_sign':[['btc'],['btc'],['btc'],['nav'],['btc','btc','btc']],'followers_count':[35946,35946
,35946
,35946
,35946
]})

2 Comments

It gave me the value as; {c, [, b, ], t}
it is the same, but still not getting that
0

Simpler, and will turn the Series back into lists so you can stack, unstack, etc:

df['column_name'] = df['column_name'].apply(set).apply(list)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.