Pandas / Python - Merge dataframes where the key is located in 2 sub-strings

Question

I have been asking this question quite few times and it seems that no one can answer it...

I am looking for a loop/fuction or a simple code that can look through 2 columns in different dataframes and output a third column. This example is quite different from a simple merge or a merge where we have one string and one substring... in this example we have 2 substrings to compare and output a third column if one of the key stored in the substring line is present in in the other substring line of the diffrent dataframe.

This is the example:

data = [['Alex','11111111 20'],['Bob','2222222 0000'],['Clarke','33333 999999']]
df = pd.DataFrame(data,columns=['Name','Code'])
df

data = [['Reed','0000 88'],['Ros',np.nan],['Jo','999999 66']]
df1 = pd.DataFrame(data,columns=['SecondName','Code2'])

What i need is to find where part of both codes are the same like 999999 or 0000 and output the SecondName

The expected output:

I have done my reserach and I found a way to locate a substring from a string but not from another substring like in my case.

Could you explain why the first row, Alex NaN, is part of the expected output? — Marcus
– Marcus, Commented Nov 12, 2020 at 10:51
Because my idea is to left merge on that dataframe which would be df, if this makes sense and because 111111111 or 20 is not found in any substing of Code2 — Caiotru
– Caiotru, Commented Nov 12, 2020 at 10:54

Serial Lazer · Accepted Answer · 2020-11-12 10:55:39Z

2

You need to split the codes and concat all possible combinations of merged-results.

Here is the working code:

import pandas as pd
import numpy as np

data = [['Alex','11111111 20'],['Bob','2222222 0000'],['Clarke','33333 999999']]
df = pd.DataFrame(data,columns=['Name','Code'])

data = [['Reed','0000 88'],['Ros',np.nan],['Jo','999999 66']]
df1 = pd.DataFrame(data,columns=['SecondName','Code2'])

df[['c1', 'c2']] = df.Code.str.split(" ", expand=True)
df1[['c1', 'c2']] = df1.Code2.str.split(" ", expand=True)

rdf = pd.DataFrame()
for col1 in ['c1', 'c2']:
    for col2 in ['c1', 'c2']:
        rdf = pd.concat([rdf, df.merge(df1, left_on=[col1], right_on=[col2], how='inner')], axis=0)

rdf = df.merge(rdf[['Name', 'SecondName']], on='Name', how='outer')
print(rdf[['Name', 'SecondName']])

Output:

     Name SecondName
0    Alex        NaN
1     Bob       Reed
2  Clarke         Jo

answered Nov 12, 2020 at 10:55

Serial Lazer

1,6691 gold badge9 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Caiotru Over a year ago

Thank you, the code works, can you please explain why do we concat a new df (rdf) if it is empty? What is the reason behind this?

Collectives™ on Stack Overflow

Pandas / Python - Merge dataframes where the key is located in 2 sub-strings

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related