1

I need to search a dataframe column for matching strings within a list and return the match into a new column in the dataframe. The below code works but it is horribly inefficient and I have millions of rows in my dataframe.

import pandas as pd 
Cars = {'MakeModel': ['HondaCivic','Toyota_Corolla','FordFocus','Audi--A4']}  
df = pd.DataFrame(data=Cars) 

mlist = ['Honda','Toyota','Ford','Audi'] 

for i in df.index:  
    for x in mlist:     
        if x in df.get_value(i,'MakeModel'): 
            df.set_value(i,'Make', x) 

1 Answer 1

1

Let's use str.extract with a capture group here. This extracts the "make" from each cell if it exists, or inserts NaNs in that row.

import re

df['Make'] = df['MakeModel'].str.extract(
    r'({})'.format('|'.join(map(re.escape, mlist))), expand=False)
df
        MakeModel    Make
0      HondaCivic   Honda
1  Toyota_Corolla  Toyota
2       FordFocus    Ford
3        Audi--A4    Audi

map(re.escape, mlist) can be replaced with mlist if you're sure your mlist strings do not contain any regex meta-characters which require escaping.

Sign up to request clarification or add additional context in comments.

2 Comments

Is there a way to use "Unknown" in place of NaNs in the this same line of code? This worked perfectly - thanks! It processed 30M rows in a less than a minute!
@bikerider somewhat, you can always use .fillna('Unknown') on the result. extract cannot do it on its own.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.