5

I have the following dataframe where I would like to print the unique values of the color column.

df = pd.DataFrame({'colors': ['green', 'green', 'purple', ['yellow , red'], 'orange'], 'names': ['Terry', 'Nor', 'Franck', 'Pete', 'Agnes']})

Output:
           colors   names
0           green   Terry
1           green     Nor
2          purple  Franck
3  [yellow , red]    Pete
4          orange   Agnes

df.colors.unique() would work fine if there wasn't the [yellow , red] row. As it is I keep getting the TypeError: unhashable type: 'list' error which is understandable.

Is there a way to still get the unique values without taking this row into account?

I tried the followings but none worked:

df = df[~df.colors.str.contains(',', na=False)] # Nothing happens
df = df[~df.colors.str.contains('[', na=False)] # Output: error: unterminated character set at position 0
df = df[~df.colors.str.contains(']', na=False)] # Nothing happens
6
  • Ideally this should work, df.loc[~df.colors.str.contains('[', na=False, regex=False), 'colors'].unique() Commented Oct 17, 2019 at 13:40
  • The above code returns ['green', 'purple', 'orange'] Commented Oct 17, 2019 at 13:42
  • @I.M. do you actually want the values inside the list also if they are unique or you want to ignore them? Commented Oct 17, 2019 at 13:42
  • For some reasons I also get the error: unterminated character set at position 0 @MahendraSingh Commented Oct 17, 2019 at 13:42
  • @vb_rises I could do with ignoring them however the ideal would be to have the unique values of the column even when they are in a list format. Commented Oct 17, 2019 at 13:44

4 Answers 4

3

If values are lists check it by isinstance method:

#changed sample data
df = pd.DataFrame({'colors': ['green', 'green', 'purple', ['yellow' , 'red'], 'orange'], 
                   'names': ['Terry', 'Nor', 'Franck', 'Pete', 'Agnes']})

df = df[~df.colors.map(lambda x : isinstance(x, list))]
print (df)
   colors   names
0   green   Terry
1   green     Nor
2  purple  Franck
4  orange   Agnes

Your solution should be changed with casting to strings and regex=False parameter:

df = df[~df.colors.astype(str).str.contains('[', na=False, regex=False)] 
print (df)
   colors   names
0   green   Terry
1   green     Nor
2  purple  Franck
4  orange   Agnes

Also if want all unique values included lists for pandas 0.25+:

s = df.colors.map(lambda x : x if isinstance(x, list) else [x]).explode().unique().tolist()
print (s)
['green', 'purple', 'yellow', 'red', 'orange']
Sign up to request clarification or add additional context in comments.

Comments

2

Let us using type

df.colors.apply(lambda x : type(x)!=list)
0     True
1     True
2     True
3    False
4     True
Name: colors, dtype: bool

Comments

1

Assuming each of the values in your dataframe are important, here's a technique I frequently use to "unpack lists":

import re

def unlock_list_from_string(string, delim=','):
    """
    lists are stored as strings (in csv files) ex. '[1,2,3]'
    this function unlocks that list
    """
    if type(string)!=str:
        return string

    # remove brackets
    clean_string = re.sub('\[|\]', '', string)
    unlocked_string = clean_string.split(delim)
    unlocked_list = [x.strip() for x in unlocked_string]
    return unlocked_list

all_colors_nested = df['colors'].apply(unlock_list_from_string)
# unnest
all_colors = [x for y in all_colors_nested for x in y ]

print(all_colors)
# ['green', 'green', 'purple', 'yellow', 'red', 'orange']


4 Comments

Your method seems to be very interesting and works really well here but I tried it on the dataframe I'm actually working with (which is a very big dataframe) and it unfortunately fails. I'll keep it for more 'normal' sized dataframe though.
What's the error you're receiving? (I use this solution on large dataframes too)
The following one: IOPub data rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable `--NotebookApp.iopub_data_rate_limit`.
Ah, your dataframe is very very big. You might consider operating in chunks.
1

Changes Input Sample

The input specified had a string which was a list(as specified by the poster), hence converted into a list of strings.

# Required Import
from ast import literal_eval

df = pd.DataFrame({
    'colors': ['green', 'green', 'purple', "['yellow' , 'red']", 'orange'], 
    'names': ['Terry', 'Nor', 'Franck', 'Pete', 'Agnes']
})

Perform literal_eval. For more info check-out literal_eval

Literal eval in order to covert string to actual list only where there is a list as string

list_records = df.colors.str.contains('[', na=False, regex=False)
df.loc[list_records, 'colors'] = df.loc[list_records, 'colors'].apply(literal_eval)

Unique Colors

Works with pandas >= 0.25

df.explode('colors')['colors'].unique()

Gives

['green', 'purple', 'yellow', 'red', 'orange']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.