0

I am trying to perform string matching between two pandas dataframe.

df_1:
ID   Text           Text_ID
1    Apple            53
2    Banana           84
3    Torent File      77

df_2: 
ID   File_name      
22   apple_mask.txt
23   melon_banana.txt
24   Torrent.pdf
25   Abc.ppt

Objective: I want to populate the Text_ID against File_name in df_2 if the string in df_1['Text'] matches with df_2['File_name']. If no matches found then populate the df_2[Text_ID] as -1. So the resultant df` looks like

ID   Flie_name           Text_ID
22   apple_mask.txt        53
23   melon_banana.txt      84
24   Torrent.pdf           77          
25   Abc.ppt               -1

I have tried this SO thread, but it is giving a column where File_name wise fuzz score is listed.

I am trying out a non fuzzy way. Please see below the code snippets:

text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
  if str(j) in str(i):
    t_i = df_1.loc[df_1['Text']==i,'Text_ID']
    text_id.append(t_i)
  else:
    t_i = -1
    text_id.append(t_i)
df_2['Text_ID'] = text_id

But I am getting a blank text_id list.

Can anybody provide some clue on this? I am OK to use fuzzywuzzy as well.

2 Answers 2

1

You can get it with the following code:

df2['Text_ID'] = -1    # set -1 by default for all the file names
for _,file_name in df2.iterrows():
    for _, text in df1.iterrows():     
        if text[0].lower() in file_name[0]:  # compare strings
            df2.loc[df2.File_name == file_name[0],'Text_ID'] = text[1] # assaign the Text_ID from df1 in df2
            break

Keep in mind:

  • String comparison: As it is now working, apple and banana are contained in apple_mask.txt and melon_banana.txt, but torrent file is not in torrent.pdf. Consider redefining the string comparison.
  • df.iterrows() returns two values, the index of the row and the values of the row, in this case I have replaced the index by _ since it is not necessary to solve this problem

result:

df2
          File_name  Text_ID
0   apple_mask.text       53
1  melon_banana.txt       84
2       Torrent.pdf       -1
3           Abc.ppt       -1
Sign up to request clarification or add additional context in comments.

1 Comment

if my answer was useful, can you mark it as an accepted answer?
0

You can try following code:

text_ls = df_1['Text'].tolist()
file_ls = df_2['File_name'].tolist()
text_id = []
for i,j in zip(text_ls,file_ls):
      if j.lower().find(i.lower()) == -1:
        t_i = -1
        df_2.loc[df_2['File_name']==j,'Text_ID']=t_i
      else:
        t_i = df_1.loc[df_1['Text']==i,'Text_ID']
        df_2.loc[df_2['File_name']==j,'Text_ID']=t_i

4 Comments

Not quite sure about this line if j.find(i) == -1:. Please note there is no entry in df_2 as -1. Also you seemed to have missed the fact that df_1['Text'] and df_2['File_name'] are not exactly same.
Refer following answer for further clarification. stackoverflow.com/a/27138045/3417134
I have updated the code to convert df_1['Text'] and df_2['File_name'] to lower just for sake to comparison.
Your method might fail, if there are (1) blanks in df_2 (2) repeat of same file name in df_2. I just checked with my actual example and it is performing correct for the 1st match.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.