Python Pandas Dataframe: add new column based on existing column, which contains lists of lists

Question

I am trying to add a column to the dataframe below, that would tell me if a person belongs to the category Green or not. It would just show Y or N, depending on whether the column category contains it, for that person. The problem is that the column category contains in some lines just a string, and in other a list of strings and even on others a list of lists.


import pandas as pd

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

How can I make it so that I get to see if the column, for each row, contains the specific 'Green' string?

Thank you.

cs95 · Accepted Answer · 2021-03-08 19:42:56Z

3

I would not bother flattening the list, just use basic string matching:

df['category'].astype(str).str.contains(r'\bgreen\b')

0     True
1    False
2     True
3     True
Name: category, dtype: bool

Add the word boundary check \b so we don't accidentally match words like "greenery" or "greenwich" which have "green" as part of a larger word.

df.assign(has_green=df['category'].astype(str)
                                  .str.contains(r'\bgreen\b')
                                  .map({True: 'Y', False: 'N'}))

      user                          category has_green
0      Bob                  [[green], [red]]         Y
1     Jane                              blue         N
2  Theresa                           [green]         Y
3    Alice  [[yellow, purple], green, brown]         Y

answered Mar 8, 2021 at 19:42

cs95

406k106 gold badges744 silver badges797 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ashkangh Over a year ago

Great answer. But a question about str accessor. I generally know what is the role of str. But, in the situations like this, I get confused. For instance, in the first row, there is a list of list, and I guess str access values inside the first list, but green is nested in another list. So, shouldn't we add another str to access nested lists? I'd appreciate if you could recommend a source to totally grasp the idea of str on the series. Or maybe you can explain it here what is happening. Thanks!

cs95 Over a year ago

@ashkangh I converted the list to a string so it doesn't matter what's inside the string anymore - it's just letters.

Avi Thaker · Accepted Answer · 2021-03-08 19:33:57Z

1

You need to use a recursive flatten.

import pandas as pd
df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice'], 'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown']]})

def flatten(x):
    rt = []
    for i in x:
        if isinstance(i,list): rt.extend(flatten(i))
        else: rt.append(i)
    return rt

def is_green(x):
    flat_list = flatten(x)
    if "green" in flat_list:
        return True
    else:
        return False

df["has_green"] = df["category"].apply(lambda x: is_green(x))

print(df)

      user                          category  has_green
0      Bob                  [[green], [red]]       True
1     Jane                              blue      False
2  Theresa                           [green]       True
3    Alice  [[yellow, purple], green, brown]       True

answered Mar 8, 2021 at 19:33

Avi Thaker

4513 silver badges10 bronze badges

4 Comments

Rodrigo Over a year ago

I am as of now trying the solution, only problem is and I didn't add it before, some lines have None. Where would you introduce an else condition or what would you do, if for example, Jane's category was None? Thank you!

Avi Thaker Over a year ago

In the is_green() section @Rodrigo, you can add a check for is non; if x is None: ..., please let me know if this helps and accept the answer if it does ?

Rodrigo Over a year ago

it does thank you! How can I accept two answers as correct?

Avi Thaker Over a year ago

You cannot, only choose one.

RJ Adriaansen · Accepted Answer · 2021-03-08 19:52:48Z

Although I would agree that basic string matching serves the purpose of the question, I would like to draw attention to the fact that flattening lists can be achieved quite easily with pd.core.common.flatten:

import pandas as pd
import ast

df = pd.DataFrame({'user': ['Bob', 'Jane','Theresa', 'Alice', 'John'], 
                   'category': [[['green'],['red']],'blue',['green'],[['yellow','purple'],'green','brown'], None]})

def fix_list(text):
    try:
        if '[' in text:
            text = ast.literal_eval(text)
        else: 
            text = [text]
    except:
        text = []
    return list(pd.core.common.flatten(text))
    
df['category'] = df['category'].apply(fix_list)
df['green'] = df['category'].apply(lambda x: 'green' in x)

Result:

	user	category	green
0	Bob	['green', 'red']	True
1	Jane	['blue']	False
2	Theresa	['green']	True
3	Alice	['yellow', 'purple', 'green', 'brown']	True
4	John	[]	False

Collectives™ on Stack Overflow

Python Pandas Dataframe: add new column based on existing column, which contains lists of lists

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related