Fastest way of searching matching string in two large dataframes using Pandas Python

Question

I have two large dataframes df1 --> 100K rows and df2 --> 600K rows and they look like the following

# df1
                                      name   price    brand  model 
0       CANON CAMERA 20 FS36dINFS MEGAPIXEL  9900.0   CANON  FS36dINFS     
1               SONY HD CAMERA 25 MEGAPIXEL  8900.0    SONY         
2       LG 55" 4K UHD LED Smart TV 55UJ635V  5890.0     LG   55UJ635V       
3      Sony 65" LED Smart TV KD-65XD8505BAE  4790.0    SONY  KD-65XD8505BAE     
4       LG 49" 4K UHD LED Smart TV 49UJ651V  4390.0      LG  49UJ651V     

#df2

                                  name       store      price
0     LG 49" 4K UHD LED Smart TV 49UJ651V     storeA   4790.0
1             SONY HD CAMERA 25 MEGAPIXEL     storeA   12.90
2  Samsung 32" LED Smart TV UE-32J4505XXE     storeB    1.30

I want to match if brand and other features from df1 are in df2 and if they exist then I do something. Currently I am using a naive approach of iterating through both the dataframes like the following

datalist = []
for idx1, row1 in df1.iterrow():
    for idx2, row2 in df2.iterrows():
        if(row1['brand'] in row2['name'] and row1['model'] in row2['name']):
                datalist.append([row1['model'],  row1['brand'], row1['name'],  row1['price'], row2['name'],row2['price'], row2['store']])

But this is taking a lot of time because both dataframes are big. I studied that sets are faster but here, the way I am using the dataframes using iterrows I can't convert to set because then I'll lose the positions. Is there any faster to do that ?

unutbu · Accepted Answer · 2017-08-22 11:07:01Z

2

If there is a lot of repetition in df1['brand'] and df1['model'], then you could possibly improve performance by creating regex patterns for the brands and models:

brands = '({})'.format('|'.join(df1['brand'].dropna().unique()))
# '(CANON|SONY|LG)'
models = '({})'.format('|'.join(df1['model'].dropna().unique()))
# '(FS36dINFS|55UJ635V|KD-65XD8505BAE|49UJ651V)'

Then you could use the str.extract method to find brand and model strings from df2['name']:

df2['brand'] = df2['name'].str.extract(brands, expand=False)
df2['model'] = df2['name'].str.extract(models, expand=False)

Then you could obtain the desired data in the form of a DataFrame by performing an inner-merge:

result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')

import re
import sys
import pandas as pd
pd.options.display.width = sys.maxsize

df1 = pd.DataFrame({'brand': ['CANON', 'SONY', 'LG', 'SONY', 'LG'], 'model': ['FS36dINFS', None, '55UJ635V', 'KD-65XD8505BAE', '49UJ651V'], 'name': ['CANON CAMERA 20 FS36dINFS MEGAPIXEL', 'SONY HD CAMERA 25 MEGAPIXEL', 'LG 55" 4K UHD LED Smart TV 55UJ635V', 'Sony 65" LED Smart TV KD-65XD8505BAE', 'LG 49" 4K UHD LED Smart TV 49UJ651V'], 'price': [9900.0, 8900.0, 5890.0, 4790.0, 4390.0]})

df2 = pd.DataFrame({'name': ['LG 49" 4K UHD LED Smart TV 49UJ651V', 'SONY HD CAMERA 25 MEGAPIXEL', 'Samsung 32" LED Smart TV UE-32J4505XXE'], 'price': [4790.0, 12.9, 1.3], 'store': ['storeA', 'storeA', 'storeB']})

bm = ['brand','model']
for col in bm:
    keywords = [re.escape(item) for item in df1[col].dropna().unique()]
    pat = '({})'.format('|'.join(keywords))
    df2[col] = df2['name'].str.extract(pat, expand=False)
result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')
print(result)

yields

  brand     model                               name_x  price_x                               name_y  price_y   store
0    LG  49UJ651V  LG 49" 4K UHD LED Smart TV 49UJ651V   4390.0  LG 49" 4K UHD LED Smart TV 49UJ651V   4790.0  storeA

edited Aug 22, 2017 at 11:07

answered Aug 22, 2017 at 0:15

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

muazfaiz Over a year ago

When I ran the code, it throws sre_constants.error: multiple repeat at position 515858 at line df2[col] = df2['name'].str.extract(pat, expand=False) ... I thought that there is some problem in the data at line 515858 so I removed that line for now but it still continues to throw the same error at the same position

unutbu Over a year ago

There might be some character(s) in the brand or model such as backslashed character which has a special meaning to the regex engine while we intended the character to be taken as a string literal. The solution is to call re.escape on the strings we want interpreted as string literals. I've updated the code to show what I mean.

muazfaiz Over a year ago

Thank you so much! I have spent a lot of time in figuring out this

Collectives™ on Stack Overflow

Fastest way of searching matching string in two large dataframes using Pandas Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related