6

I've looked but seem to be coming up dry for an answer to the following question.

I have a pandas dataframe analogous to this (call it 'df'):

        Type              Set
    1   theGreen          Z
    2   andGreen          Z           
    3   yellowRed         X
    4   roadRed           Y

I want to add another column to the dataframe (or generate a series) of the same length as the dataframe (= equal number of records/rows) which assigns a numerical coding variable (1) if the Type contains the string "Green", (0) otherwise.

Essentially, I'm trying to find a way of doing this:

   df['color'] = np.where(df['Type'] == 'Green', 1, 0)

Except instead of the usual numpy operators (<,>,==,!=, etc.) I need a way of saying "in" or "contains". Is this possible? Any and all help appreciated!

1 Answer 1

7

Use str.contains:

df['color'] = np.where(df['Type'].str.contains('Green'), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Another solution with apply:

df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

Second solution is faster, but doesn't work with NaN in column Type, then return error:

TypeError: argument of type 'float' is not iterable

Timings:

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)  

In [276]: %timeit df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
10 loops, best of 3: 94.1 ms per loop

In [277]: %timeit df['color1'] = np.where(df['Type'].str.contains('Green'), 1, 0)
1 loop, best of 3: 256 ms per loop
Sign up to request clarification or add additional context in comments.

4 Comments

Could you write a function that handles NaN and apply it instead of the lambda?
@wwii I am only on phone. I add solution tomorrow.
@wwii - it is more complicated - for me works df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x if pd.notnull(x) else False), 1, np.where(df['Type'].isnull(), np.nan, 0)) with df = pd.DataFrame({ 'Set': {1: 'Z', 2: 'Z', 3: 'X', 4: 'Y'}, 'Type': {1: 'theGreen', 2: 'andGreen', 3: 'yellowRed', 4: np.nan}}, columns= ['Type','Set'])
I was thinking more like - def is_green(thing): try: return 'Green' in thing; except (ValueError, TypeError) as e: return False - then, df['color'] = np.where(df['Type'].apply(is_green), 1, 0)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.