Conditional data selection with text string data in pandas dataframe

Question

I've looked but seem to be coming up dry for an answer to the following question.

I have a pandas dataframe analogous to this (call it 'df'):

        Type              Set
    1   theGreen          Z
    2   andGreen          Z           
    3   yellowRed         X
    4   roadRed           Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which assigns a numerical coding variable (1) if the Type contains the string "Green", (0) otherwise.

Essentially, I'm trying to find a way of doing this:

   df['color'] = np.where(df['Type'] == 'Green', 1, 0)

Except instead of the usual numpy operators (<,>,==,!=, etc.) I need a way of saying "in" or "contains". Is this possible? Any and all help appreciated!

jezrael · Accepted Answer · 2016-11-15 15:49:02Z

7

Use str.contains:

df['color'] = np.where(df['Type'].str.contains('Green'), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Another solution with apply:

df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Second solution is faster, but doesn't work with NaN in column Type, then return error:

TypeError: argument of type 'float' is not iterable

Timings:

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)  

In [276]: %timeit df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
10 loops, best of 3: 94.1 ms per loop

In [277]: %timeit df['color1'] = np.where(df['Type'].str.contains('Green'), 1, 0)
1 loop, best of 3: 256 ms per loop

edited Nov 15, 2016 at 15:49

answered Nov 15, 2016 at 15:38

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

wwii Over a year ago

Could you write a function that handles NaN and apply it instead of the lambda?

jezrael Over a year ago

@wwii I am only on phone. I add solution tomorrow.

jezrael Over a year ago

@wwii - it is more complicated - for me works

df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x if pd.notnull(x) else False), 1,               np.where(df['Type'].isnull(), np.nan, 0))

with

df = pd.DataFrame({ 'Set': {1: 'Z', 2: 'Z', 3: 'X', 4: 'Y'},  'Type': {1: 'theGreen', 2: 'andGreen', 3: 'yellowRed', 4: np.nan}}, columns= ['Type','Set'])

wwii Over a year ago

I was thinking more like - def is_green(thing): try: return 'Green' in thing; except (ValueError, TypeError) as e: return False - then, df['color'] = np.where(df['Type'].apply(is_green), 1, 0)

Collectives™ on Stack Overflow

Conditional data selection with text string data in pandas dataframe

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related