1

I'm trying to figure out if there's a way to do fuzzy merges of string in Pandas based on the difflib SequenceMatcher ration. Basically, I have two dataframes that look like this:

df_a
company    address        merged
Apple     PO Box 3435       1

df_b
company     address
Apple Inc   PO Box 343

And I want to merge like this:

df_c = pd.merge(df_a, df_b, how = 'left', on = (difflib.SequenceMatcher(None, df_a['company'], df_b['company']).ratio() > .6) and (difflib.SequenceMatcher(None, df_a['address'], df_b['address']).ratio() > .6)

There are a few posts that are close to what I'm looking for, but none of them work with what I want to do. Any suggestions on how to do this kind of fuzzy merge using difflib?

1

1 Answer 1

2

Something that might work: test for partial matches for all combinations of column values. If there is a match assign a key to df_b for merging

df_a['merge_comp'] = df_a['company'] # we will use these as the merge keys
df_a['merge_addr'] = df_a['address']

for comp_a, addr_a in df_a[['company','address']].values:
    for ixb, (comp_b, addr_b) in enumerate(df_b[['company','address']].values)
        if difflib.SequenceMatcher(None,comp_a,comp_b).ratio() > .6:
            df_b.ix[ixb,'merge_comp'] = comp_a # creates a merge key in df_b
        if difflib.SequenceMatcher(None,addr_a, addr_b).ratio() > .6:
            df_b.ix[ixb,'merge_addr'] = addr_a # creates a merge key in df_b

Now you can merge

merged_df = pandas.merge(df_a,df_b,on=['merge_addr','merge_comp'],how='inner')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.