merge 2 dataframes based on partial string-match between columns

Question

I have two data frames df1 and df2 as shown below:

Df1:

                  movie    correct_id
0              birdman        N/A
1     avengers: endgame        N/A
2              deadpool        N/A
3  once upon deadpool        N/A

Df2: data frame of reference

          movie              correct_id
0               birdmans          4
1  The avengers: endgame          2
2               The King          3
3   once upon a deadpool          1

Expected Result:

            movie    correct_id
0              birdman        4
1     avengers: endgame       2
2             deadpool       N/A
3   once upon deadpool        1

Please how do I merge two dataframes based on partial string match?

NB: The movie's name not exactly the same

First you'll need to define precisely what you consider a partial string match. And what happened to The King? — Arne
– Arne, Commented Jun 4, 2021 at 11:43
The df2 i considered as a reference, the king doesnt exists in the reference. I mean that the movie's name not exactly the same. exp 'The avengers:endgame' in the ref (df2) but in df1 it is 'avengers:endgame' — Learner
– Learner, Commented Jun 4, 2021 at 11:49
Have a look to fuzzywuzzy or rapidfuzz to compute string distance and take for each key in df1 to key in df2 that minimizes levenstein distance — linog
– linog, Commented Jun 4, 2021 at 11:49

Corralien · Accepted Answer · 2021-06-04 13:05:23Z

1

From a previous post.

Input data:

>>> df1
                movie  correct_id
0             birdman         NaN
1   avengers: endgame         NaN
2            deadpool         NaN
3  once upon deadpool         NaN

>>> df2
                   movie  correct_id
0               birdmans           4
1  The avengers: endgame           2
2               The King           3
3   once upon a deadpool           1

A bit of fuzzy logic:

from fuzzywuzzy import process

dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
                               .tolist(), columns=["movie", "ratio", "best_id"])

>>> dfm
                            movie  ratio  best_id
0                        birdmans     93        0
1  The avengers: endgame: endgame     90        1
2            once upon a deadpool     90        3
3            once upon a deadpool     95        3

The index of dfm is the index of df1 rather than the column best_id is the index of df2. Now you can update your first dataframe:

THRESHOLD = 90  # adjust this number

ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")

>>> df1
                movie  correct_id
0             birdman           4
1   avengers: endgame           2
2            deadpool        <NA>
3  once upon deadpool           1

edited Jun 4, 2021 at 13:05

answered Jun 4, 2021 at 12:52

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Corralien Over a year ago

No, it's not possible, that was not the good result. Please, check my updated answer.

Learner Over a year ago

Actually the first one works verry well for me, But when i tried the updated answer, i got an error TypeError: object cannot be converted to an IntegerDtype

Corralien Over a year ago

Remove astype("Int64") and see the result. What is your version of Pandas?

Collectives™ on Stack Overflow

merge 2 dataframes based on partial string-match between columns

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related