1

I have been asking this question quite few times and it seems that no one can answer it...

I am looking for a loop/fuction or a simple code that can look through 2 columns in different dataframes and output a third column. This example is quite different from a simple merge or a merge where we have one string and one substring... in this example we have 2 substrings to compare and output a third column if one of the key stored in the substring line is present in in the other substring line of the diffrent dataframe.

This is the example:

data = [['Alex','11111111 20'],['Bob','2222222 0000'],['Clarke','33333 999999']]
df = pd.DataFrame(data,columns=['Name','Code'])
df

data = [['Reed','0000 88'],['Ros',np.nan],['Jo','999999 66']]
df1 = pd.DataFrame(data,columns=['SecondName','Code2'])

enter image description here

What i need is to find where part of both codes are the same like 999999 or 0000 and output the SecondName

The expected output:

enter image description here

I have done my reserach and I found a way to locate a substring from a string but not from another substring like in my case.

2
  • Could you explain why the first row, Alex NaN, is part of the expected output? Commented Nov 12, 2020 at 10:51
  • 1
    Because my idea is to left merge on that dataframe which would be df, if this makes sense and because 111111111 or 20 is not found in any substing of Code2 Commented Nov 12, 2020 at 10:54

1 Answer 1

2

You need to split the codes and concat all possible combinations of merged-results.

Here is the working code:

import pandas as pd
import numpy as np

data = [['Alex','11111111 20'],['Bob','2222222 0000'],['Clarke','33333 999999']]
df = pd.DataFrame(data,columns=['Name','Code'])

data = [['Reed','0000 88'],['Ros',np.nan],['Jo','999999 66']]
df1 = pd.DataFrame(data,columns=['SecondName','Code2'])

df[['c1', 'c2']] = df.Code.str.split(" ", expand=True)
df1[['c1', 'c2']] = df1.Code2.str.split(" ", expand=True)

rdf = pd.DataFrame()
for col1 in ['c1', 'c2']:
    for col2 in ['c1', 'c2']:
        rdf = pd.concat([rdf, df.merge(df1, left_on=[col1], right_on=[col2], how='inner')], axis=0)

rdf = df.merge(rdf[['Name', 'SecondName']], on='Name', how='outer')
print(rdf[['Name', 'SecondName']])

Output:

     Name SecondName
0    Alex        NaN
1     Bob       Reed
2  Clarke         Jo
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, the code works, can you please explain why do we concat a new df (rdf) if it is empty? What is the reason behind this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.