0

Background

I have a df

import pandas as pd
import nltk
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

df= pd.DataFrame({'ID': [1,2,3], 
                           'Text':['This num dogs and cats is (111)888-8780 and other',
                              'dont block cow 23 here',
                              'cat two num: dog  and cows here']    
                      })

I also have a list

 word_list = ['dog', 'cat', 'cow']

and a function that is supposed to do fuzzy matching on the Text column of the df with the word_list

def fuzzy(row, word_list):
    
    tweet = row[0]
    fuzzy_match = []

    for word in word_list:
     
        token_words = nltk.word_tokenize(tweet)
        
        for token in range(0, len(token_words) - 1):
                
            fuzzy_fx = process.extract(word_list[word], token_words[token], limit=100, scorer = fuzz.ratio)
            fuzzy_match.append(fuzzy_fx[0])

    return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])

I then join

df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))

But I get an error

TypeError: expected string or bytes-like object

Desired output My desired output would be a 1) new column Fuzzy_Match with the output of the fuzzy function

    ID  Text                                                 Fuzzy_Match
0   1   This num dogs and cats is (111)888-8780 and other   output of fuzzy 1
1   2   dont block cow 23 here                              output of fuzzy 2
2   3   cat two num: dog and cows here                      output of fuzzy 3

Question What do I need to do to get my desired output?

1
  • use tweet = row[1] , currently you accessing the ID when you use row[0]. Commented Jun 23, 2021 at 23:18

1 Answer 1

1

This should work:

In [32]: def fuzzy(row, word_list):
    ...:     tweet = row[1]
    ...:     fuzzy_match = []
    ...:     token_words = nltk.word_tokenize(tweet) 
    ...:     for word in word_list:
    ...: 
    ...:         fuzzy_fx = process.extract(word, token_words, limit=100, scorer = fuzz.ratio)
    ...:         fuzzy_match.append(fuzzy_fx[0])
    ...: 
    ...:     return pd.Series([fuzzy_match], index = ['Fuzzy_Match'])

df_fuzz = df.join(df.apply(lambda x: fuzzy(x, word_list), axis = 1))

process.extract() expects a list as the second argument. you can read more about it here. python fuzzywuzzy's process.extract(): how does it work?

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.