0

I have two data frames df1 and df2 as shown below:

Df1:

                  movie    correct_id
0              birdman        N/A
1     avengers: endgame        N/A
2              deadpool        N/A
3  once upon deadpool        N/A

Df2: data frame of reference

          movie              correct_id
0               birdmans          4
1  The avengers: endgame          2
2               The King          3
3   once upon a deadpool          1

Expected Result:

            movie    correct_id
0              birdman        4
1     avengers: endgame       2
2             deadpool       N/A
3   once upon deadpool        1

Please how do I merge two dataframes based on partial string match?

NB: The movie's name not exactly the same

3
  • First you'll need to define precisely what you consider a partial string match. And what happened to The King? Commented Jun 4, 2021 at 11:43
  • The df2 i considered as a reference, the king doesnt exists in the reference. I mean that the movie's name not exactly the same. exp 'The avengers:endgame' in the ref (df2) but in df1 it is 'avengers:endgame' Commented Jun 4, 2021 at 11:49
  • Have a look to fuzzywuzzy or rapidfuzz to compute string distance and take for each key in df1 to key in df2 that minimizes levenstein distance Commented Jun 4, 2021 at 11:49

1 Answer 1

1

From a previous post.

Input data:

>>> df1
                movie  correct_id
0             birdman         NaN
1   avengers: endgame         NaN
2            deadpool         NaN
3  once upon deadpool         NaN

>>> df2
                   movie  correct_id
0               birdmans           4
1  The avengers: endgame           2
2               The King           3
3   once upon a deadpool           1

A bit of fuzzy logic:

from fuzzywuzzy import process

dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
                               .tolist(), columns=["movie", "ratio", "best_id"])
>>> dfm
                            movie  ratio  best_id
0                        birdmans     93        0
1  The avengers: endgame: endgame     90        1
2            once upon a deadpool     90        3
3            once upon a deadpool     95        3

The index of dfm is the index of df1 rather than the column best_id is the index of df2. Now you can update your first dataframe:

THRESHOLD = 90  # adjust this number

ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")
>>> df1
                movie  correct_id
0             birdman           4
1   avengers: endgame           2
2            deadpool        <NA>
3  once upon deadpool           1
Sign up to request clarification or add additional context in comments.

3 Comments

No, it's not possible, that was not the good result. Please, check my updated answer.
Actually the first one works verry well for me, But when i tried the updated answer, i got an error TypeError: object cannot be converted to an IntegerDtype
Remove astype("Int64") and see the result. What is your version of Pandas?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.