I have two dataframes A and B. A is smaller with 500 lines and B is larger with 20000 lines. A's columns are:
A.columns = ['name','company','model','family']
and B's columns are:
B.columns = ["title", "price"]
The title column in B is a large messy string but it does contain strings from 3 columns in A namely company, model and family (forget about 'name' column because name in A itself is a combination of company, model and family). I need to match each row from A to a single row in B. This is my solution:
out=pd.DataFrame(columns={0,1,2,3,4,5})
out.columns=["name", 'company', 'model', 'family', 'title', 'price']
for index, row in A.iterrows():
lst=[A.loc[index,'family'], A.loc[index,'model'], A.loc[index,'company']]
for i, r in B.iterrows():
if all(w in B.loc[i,'title'] for w in lst):
out.loc[index,'name']=A.loc[index,'name']
out.loc[index,'company']=A.loc[index,'company']
out.loc[index,'model']=A.loc[index,'model']
out.loc[index,'family']=A.loc[index,'family']
out.loc[index,'title']=B.loc[i,'title']
out.loc[index,'price']=B.loc[i,'price']
break
This does the job very inefficiently and takes a long time. I know this is a "record linkage" problem and people are researching its accuracy and speed but is there a faster more efficient way for doing this in Pandas? If I check only one or two items from lst in the title, it will be faster but my concern is that it will decrease the accuracy...
In terms of accuracy, I'd rather get less matches than wrong matches.
Also, A.dtypes and B.dtypes show that both dataframes' columns are objects:
title object
price object
dtype: object
I appreciate any comment. Thanks
*********UPDATE***********
Part of the two files can be found at: A B
I do some cleaning on them as:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.colors as mcol
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import math
A = pd.read_csv('A.txt', delimiter=',', header=None)
A.columns = ['product name','manufacturer','model','family','announced date']
for index, row in A.iterrows():
A.loc[index, "product name"] = A.loc[index, "product name"].split('"')[3]
A.loc[index, "manufacturer"] = A.loc[index, "manufacturer"].split('"')[1]
A.loc[index, "model"] = A.loc[index, "model"].split('"')[1]
if 'family' in A.loc[index, "family"]:
A.loc[index, "family"] = A.loc[index, "family"].split('"')[1]
if 'announced' in A.loc[index, "family"]:
A.loc[index, "announced date"] = A.loc[index, "family"]
A.loc[index, "family"] = ''
A.loc[index, "announced date"] = A.loc[index, "announced date"].split('"')[1]
A.columns=['product name','manufacturer','model','family','announced date']
A.reset_index()
B = pd.read_csv('B.txt', error_bad_lines=False, warn_bad_lines=False, header=None)
B.columns = ["title", "manufacturer", "currency", "price"]
pd.options.display.max_colwidth=200
for index, row in B.iterrows():
B.loc[index,'manufacturer']=B.loc[index,'manufacturer'].split('"')[1]
B.loc[index,'currency']=B.loc[index,'currency'].split('"')[1]
B.loc[index,'price']=B.loc[index,'price'].split('"')[1]
B.loc[index,'title']=B.loc[index,'title'].split('"')[3]
then Andrew's approach as suggested in the answer:
def match_strs(row):
return np.where(B.title.str.contains(row['manufacturer']) & \
B.title.str.contains(row['family']) & \
B.title.str.contains(row['model']))[0][0]
A['merge_idx'] = A.apply(match_strs, axis='columns')
(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
.drop('merge_idx', 1)
.dropna())
and like I said, some complications happen that I can't figure out. Many thanks for your help