0

I'm getting an error while using the FuzzyWuzzy library in Python 3. I'm working with CSV files also using the Pandas library.

I have the following data in my CSV file:

> BBL          CorporationName               CorporationName2
  1            123 Elm St LLC                123 Elm St LLC    
  2            ABC Realty, INC               ABC Realty, INC     
  3            123 Elm Street, LLC           123 Elm Street, LLC 
  4            ABC Realty Incorporated       ABC Realty Incorporated        

The CorporationName and CorporationName2 columns are actually the same. They each contain the names of real estate-related businesses. These names of theses businesses appear multiple times in each column, but as you can see, they sometimes appear in slightly different manifestations.

My goal is to take each string in CorporationName and compare it with all of the strings in CorporationName2. I would like then for FuzzyWuzzy to return the 5 most relevant strings from CorporationName2 (i.e. the possible variations of that name). This is just the first step in a massive string matching task I have subjected myself to.

> import pandas as pd
  from fuzzywuzzy import process
  from fuzzywuzzy import fuzz 
  import csv

  df = pd.read_csv('yescorp_fuzz.csv')
  test_list = df.CorporationName
  test_list1 = df.CorporationName1


  def ownermatch():
   for i in test_list:
     result = process.extract(i,test_list1, limit=5)
     print(result)


   ownermatch()

This is the traceback error:

Traceback (most recent call last):
  File "C:/Python34/YesCorpFuzzy4_15.py", line 17, in <module>
    ownermatch()
  File "C:/Python34/YesCorpFuzzy4_15.py", line 13, in ownermatch
    result = process.extract(i,test_list1, limit=5)
  File "C:\Python34\lib\site-packages\fuzzywuzzy\process.py", line 103, in extract
    processed = processor(choice)
  File "C:\Python34\lib\site-packages\fuzzywuzzy\utils.py", line 84, in full_process
    string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
  File "C:\Python34\lib\site-packages\fuzzywuzzy\string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
    return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer
>>> 

To be perfectly honest, I'm not sure what's going on here. I couldn't find much on the internet, either.

Any help that you could provide would be greatly appreciated.

Thanks!

1 Answer 1

1

I think youre running into a situation where you have a null value or some non-string data type in one of the dataframe columns. FuzzyWuzzy expects a string and when it encounters a NaN or another non-string, it throws the error. You could get rid of this by filling in the NaN's with the other column's value:

df.CorporationName.fillna(df.CorporationName1, inplace = True)
df.CorporationName1.fillna(df.CorporationName, inplace = True)

Or converting non-strings:

df.loc[:, 'CorporationName'] = df.CorporationName.astype(str)
Sign up to request clarification or add additional context in comments.

7 Comments

Hi @Sam, strangely it is still returning the same error. Apparently it is something else.
@Steven could you possible have other datatypes in your df?
that seemed to be it. Thanks!
Cool glad to help. If ya wanna accept my answer or at least gimme an upvote here I'd appreciate it ;)
absolutely, sorry I forgot! Any reason you use df.loc for converting the strings?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.