0

I have two large dataframes df1 --> 100K rows and df2 --> 600K rows and they look like the following

# df1
                                      name   price    brand  model 
0       CANON CAMERA 20 FS36dINFS MEGAPIXEL  9900.0   CANON  FS36dINFS     
1               SONY HD CAMERA 25 MEGAPIXEL  8900.0    SONY         
2       LG 55" 4K UHD LED Smart TV 55UJ635V  5890.0     LG   55UJ635V       
3      Sony 65" LED Smart TV KD-65XD8505BAE  4790.0    SONY  KD-65XD8505BAE     
4       LG 49" 4K UHD LED Smart TV 49UJ651V  4390.0      LG  49UJ651V     

#df2

                                  name       store      price
0     LG 49" 4K UHD LED Smart TV 49UJ651V     storeA   4790.0
1             SONY HD CAMERA 25 MEGAPIXEL     storeA   12.90
2  Samsung 32" LED Smart TV UE-32J4505XXE     storeB    1.30

I want to match if brand and other features from df1 are in df2 and if they exist then I do something. Currently I am using a naive approach of iterating through both the dataframes like the following

datalist = []
for idx1, row1 in df1.iterrow():
    for idx2, row2 in df2.iterrows():
        if(row1['brand'] in row2['name'] and row1['model'] in row2['name']):
                datalist.append([row1['model'],  row1['brand'], row1['name'],  row1['price'], row2['name'],row2['price'], row2['store']])

But this is taking a lot of time because both dataframes are big. I studied that sets are faster but here, the way I am using the dataframes using iterrows I can't convert to set because then I'll lose the positions. Is there any faster to do that ?

1 Answer 1

2

If there is a lot of repetition in df1['brand'] and df1['model'], then you could possibly improve performance by creating regex patterns for the brands and models:

brands = '({})'.format('|'.join(df1['brand'].dropna().unique()))
# '(CANON|SONY|LG)'
models = '({})'.format('|'.join(df1['model'].dropna().unique()))
# '(FS36dINFS|55UJ635V|KD-65XD8505BAE|49UJ651V)'

Then you could use the str.extract method to find brand and model strings from df2['name']:

df2['brand'] = df2['name'].str.extract(brands, expand=False)
df2['model'] = df2['name'].str.extract(models, expand=False)

Then you could obtain the desired data in the form of a DataFrame by performing an inner-merge:

result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')

import re
import sys
import pandas as pd
pd.options.display.width = sys.maxsize

df1 = pd.DataFrame({'brand': ['CANON', 'SONY', 'LG', 'SONY', 'LG'], 'model': ['FS36dINFS', None, '55UJ635V', 'KD-65XD8505BAE', '49UJ651V'], 'name': ['CANON CAMERA 20 FS36dINFS MEGAPIXEL', 'SONY HD CAMERA 25 MEGAPIXEL', 'LG 55" 4K UHD LED Smart TV 55UJ635V', 'Sony 65" LED Smart TV KD-65XD8505BAE', 'LG 49" 4K UHD LED Smart TV 49UJ651V'], 'price': [9900.0, 8900.0, 5890.0, 4790.0, 4390.0]})

df2 = pd.DataFrame({'name': ['LG 49" 4K UHD LED Smart TV 49UJ651V', 'SONY HD CAMERA 25 MEGAPIXEL', 'Samsung 32" LED Smart TV UE-32J4505XXE'], 'price': [4790.0, 12.9, 1.3], 'store': ['storeA', 'storeA', 'storeB']})

bm = ['brand','model']
for col in bm:
    keywords = [re.escape(item) for item in df1[col].dropna().unique()]
    pat = '({})'.format('|'.join(keywords))
    df2[col] = df2['name'].str.extract(pat, expand=False)
result = pd.merge(df1.dropna(subset=bm), df2.dropna(subset=bm), on=bm, how='inner')
print(result)

yields

  brand     model                               name_x  price_x                               name_y  price_y   store
0    LG  49UJ651V  LG 49" 4K UHD LED Smart TV 49UJ651V   4390.0  LG 49" 4K UHD LED Smart TV 49UJ651V   4790.0  storeA
Sign up to request clarification or add additional context in comments.

3 Comments

When I ran the code, it throws sre_constants.error: multiple repeat at position 515858 at line df2[col] = df2['name'].str.extract(pat, expand=False) ... I thought that there is some problem in the data at line 515858 so I removed that line for now but it still continues to throw the same error at the same position
There might be some character(s) in the brand or model such as backslashed character which has a special meaning to the regex engine while we intended the character to be taken as a string literal. The solution is to call re.escape on the strings we want interpreted as string literals. I've updated the code to show what I mean.
Thank you so much! I have spent a lot of time in figuring out this

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.