0

I have two dataframes A and B. A is smaller with 500 lines and B is larger with 20000 lines. A's columns are:

A.columns = ['name','company','model','family']

and B's columns are:

B.columns = ["title", "price"]

The title column in B is a large messy string but it does contain strings from 3 columns in A namely company, model and family (forget about 'name' column because name in A itself is a combination of company, model and family). I need to match each row from A to a single row in B. This is my solution:

out=pd.DataFrame(columns={0,1,2,3,4,5})
out.columns=["name", 'company', 'model', 'family', 'title', 'price']

for index, row in A.iterrows():
    lst=[A.loc[index,'family'], A.loc[index,'model'], A.loc[index,'company']]
    for i, r in B.iterrows():
        if all(w in B.loc[i,'title'] for w in lst):        
            out.loc[index,'name']=A.loc[index,'name']
            out.loc[index,'company']=A.loc[index,'company']
            out.loc[index,'model']=A.loc[index,'model']
            out.loc[index,'family']=A.loc[index,'family']

            out.loc[index,'title']=B.loc[i,'title']
            out.loc[index,'price']=B.loc[i,'price']
            break

This does the job very inefficiently and takes a long time. I know this is a "record linkage" problem and people are researching its accuracy and speed but is there a faster more efficient way for doing this in Pandas? If I check only one or two items from lst in the title, it will be faster but my concern is that it will decrease the accuracy...

In terms of accuracy, I'd rather get less matches than wrong matches.

Also, A.dtypes and B.dtypes show that both dataframes' columns are objects:

title           object
price           object
dtype: object

I appreciate any comment. Thanks

*********UPDATE***********

Part of the two files can be found at: A B

I do some cleaning on them as:

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.colors as mcol
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import math

A = pd.read_csv('A.txt', delimiter=',', header=None) 
A.columns = ['product name','manufacturer','model','family','announced date']

for index, row in A.iterrows():    
    A.loc[index, "product name"] = A.loc[index, "product name"].split('"')[3]
    A.loc[index, "manufacturer"] = A.loc[index, "manufacturer"].split('"')[1]
    A.loc[index, "model"] = A.loc[index, "model"].split('"')[1]
    if 'family' in A.loc[index, "family"]:
        A.loc[index, "family"] = A.loc[index, "family"].split('"')[1]
    if 'announced' in A.loc[index, "family"]:
        A.loc[index, "announced date"] = A.loc[index, "family"]
        A.loc[index, "family"] = ''
    A.loc[index, "announced date"] = A.loc[index, "announced date"].split('"')[1]

A.columns=['product name','manufacturer','model','family','announced date']
A.reset_index()

B = pd.read_csv('B.txt', error_bad_lines=False, warn_bad_lines=False, header=None) 

B.columns = ["title", "manufacturer", "currency", "price"]
pd.options.display.max_colwidth=200

for index, row in B.iterrows():
    B.loc[index,'manufacturer']=B.loc[index,'manufacturer'].split('"')[1]
    B.loc[index,'currency']=B.loc[index,'currency'].split('"')[1]
    B.loc[index,'price']=B.loc[index,'price'].split('"')[1]
    B.loc[index,'title']=B.loc[index,'title'].split('"')[3]

then Andrew's approach as suggested in the answer:

def match_strs(row):
    return np.where(B.title.str.contains(row['manufacturer']) & \
                    B.title.str.contains(row['family']) & \
                    B.title.str.contains(row['model']))[0][0]

A['merge_idx'] = A.apply(match_strs, axis='columns')

(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
  .drop('merge_idx', 1)
  .dropna())

and like I said, some complications happen that I can't figure out. Many thanks for your help

1 Answer 1

2

Here's some sample data to work with:

import numpy as np
import pandas as pd

# make A df
manufacturer = ['A','B','C']
model = ['foo','bar','baz']
family = ['X','Y','Z']
name = ['{}_{}_{}'.format(manufacturer[i],model[i],family[i]) for i in range(len(company))]
A = pd.DataFrame({'name':name,'manufacturer': manufacturer,'model':model,'family':family})

# A
  manufacturer family model     name
     0       A      X   foo  A_foo_X
     1       B      Y   bar  B_bar_Y
     2       C      Z   baz  C_baz_Z

# make B df
title = ['blahblahblah']
title.extend( ['{}_{}'.format(n, 'blahblahblah') for n in name] )
B = pd.DataFrame({'title':title,'price':np.random.randint(1,100,4)})

# B
   price                 title
0     62          blahblahblah
1      7  A_foo_X_blahblahblah
2     92  B_bar_Y_blahblahblah
3     24  C_baz_Z_blahblahblah

We can make a function that matches row indices in A and B, based on your matching criteria, and store them in a new column:

def match_strs(row):
    match_result = (np.where(B.title.str.contains(row['manufacturer']) & \
                             B.title.str.contains(row['family']) & \
                             B.title.str.contains(row['model'])))
    if not len(match_result[0]):
        return None
    return match_result[0][0]

A['merge_idx'] = A.apply(match_strs, axis='columns')

Then merge A and B:

(A.merge(B, left_on='merge_idx', right_on='index', right_index=True, how='right')
  .drop('merge_idx', 1)
  .dropna())

Output:

  manufacturer family model     name  price                 title
     0       A      X   foo  A_foo_X     23  A_foo_X_blahblahblah
     1       B      Y   bar  B_bar_Y     14  B_bar_Y_blahblahblah
     2       C      Z   baz  C_baz_Z     19  C_baz_Z_blahblahblah

Is that what you're looking for?

Note that if you want to keep rows in B, even with no match in A, just remove the .dropna() at the end of the merge.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you so much. If I use [0][0] as you did, I get this error: IndexError: ('index 0 is out of bounds for axis 0 with size 0', 'occurred at index 0') If I do not use [0][0], my merge_idx looks like this: merge_idx ([],) ([],) ([],) ([],) ([6327, 6328, 6343, 7106],) ([497, 3195, 3196, 3197, 8966, 11324],) ([],) ([],) ... and so on. I think it means for some entries there has been multiple matches? and then once I try your merge method, I get: TypeError: unhashable type: 'numpy.ndarray'
Can you confirm that the code works on the example data I provided? If that works, then there's something about your actual data that is different than the example data I created. It would be helpful if you can update your post with some of the actual data you're using, or with code that will generate analogous data to your real use case, similar to what I've tried to do in my answer.
Your code works fine with the example data. Sure. Please see the update and my complete code. Didn't know where to upload so it's on a remote server. I really appreciate your help.
@user3709260 the data you provided was helpful. Missing fields in A caused errors to occur in match_strs(). I've updated the function now and tested it successfully on the data you provided. Does this work for you now?
@user3709260 Glad to help, if this answer solved your problem please mark it as accepted by clicking the check mark next to the answer. See: How does accepting an answer work? for more information. If you found my answer particularly useful you can indicate this by upvoting.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.