0

I am trying to add a column to the dataframe below, that would tell me if a person belongs to the category Green or not. It would just show Y or N, depending on whether the column category contains it, for that person. The problem is that the column category contains in some lines just a string, and in other a list of strings and even on others a list of lists.


import pandas as pd

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

How can I make it so that I get to see if the column, for each row, contains the specific 'Green' string?

Thank you.

3 Answers 3

3

I would not bother flattening the list, just use basic string matching:

df['category'].astype(str).str.contains(r'\bgreen\b')

0     True
1    False
2     True
3     True
Name: category, dtype: bool

Add the word boundary check \b so we don't accidentally match words like "greenery" or "greenwich" which have "green" as part of a larger word.


df.assign(has_green=df['category'].astype(str)
                                  .str.contains(r'\bgreen\b')
                                  .map({True: 'Y', False: 'N'}))

      user                          category has_green
0      Bob                  [[green], [red]]         Y
1     Jane                              blue         N
2  Theresa                           [green]         Y
3    Alice  [[yellow, purple], green, brown]         Y
Sign up to request clarification or add additional context in comments.

2 Comments

Great answer. But a question about str accessor. I generally know what is the role of str. But, in the situations like this, I get confused. For instance, in the first row, there is a list of list, and I guess str access values inside the first list, but green is nested in another list. So, shouldn't we add another str to access nested lists? I'd appreciate if you could recommend a source to totally grasp the idea of str on the series. Or maybe you can explain it here what is happening. Thanks!
@ashkangh I converted the list to a string so it doesn't matter what's inside the string anymore - it's just letters.
1

You need to use a recursive flatten.

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

def flatten(x):
    rt = []
    for i in x:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt

def is_green(x):
    flat_list = flatten(x)
    if "green" in flat_list:
        return True
    else:
        return False

df["has_green"] = df["category"].apply(lambda x: is_green(x))

print(df)
      user                          category  has_green
0      Bob                  [[green], [red]]       True
1     Jane                              blue      False
2  Theresa                           [green]       True
3    Alice  [[yellow, purple], green, brown]       True

4 Comments

I am as of now trying the solution, only problem is and I didn't add it before, some lines have None. Where would you introduce an else condition or what would you do, if for example, Jane's category was None? Thank you!
In the is_green() section @Rodrigo, you can add a check for is non; if x is None: ..., please let me know if this helps and accept the answer if it does ?
it does thank you! How can I accept two answers as correct?
You cannot, only choose one.
1

Although I would agree that basic string matching serves the purpose of the question, I would like to draw attention to the fact that flattening lists can be achieved quite easily with pd.core.common.flatten:

import pandas as pd
import ast

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice', 'John'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown'], None]})

def fix_list(text):
    try:
        if '[' in text:
            text = ast.literal_eval(text)
        else: 
            text = [text]
    except:
        text = []
    return list(pd.core.common.flatten(text))
    
df['category'] = df['category'].apply(fix_list)
df['green'] = df['category'].apply(lambda x: 'green' in x)

Result:

user category green
0 Bob ['green', 'red'] True
1 Jane ['blue'] False
2 Theresa ['green'] True
3 Alice ['yellow', 'purple', 'green', 'brown'] True
4 John [] False

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.