TypeError when using FuzzyWuzzy and Pandas for string matching

Question

I'm getting an error while using the FuzzyWuzzy library in Python 3. I'm working with CSV files also using the Pandas library.

I have the following data in my CSV file:

> BBL          CorporationName               CorporationName2
  1            123 Elm St LLC                123 Elm St LLC    
  2            ABC Realty, INC               ABC Realty, INC     
  3            123 Elm Street, LLC           123 Elm Street, LLC 
  4            ABC Realty Incorporated       ABC Realty Incorporated

The CorporationName and CorporationName2 columns are actually the same. They each contain the names of real estate-related businesses. These names of theses businesses appear multiple times in each column, but as you can see, they sometimes appear in slightly different manifestations.

My goal is to take each string in CorporationName and compare it with all of the strings in CorporationName2. I would like then for FuzzyWuzzy to return the 5 most relevant strings from CorporationName2 (i.e. the possible variations of that name). This is just the first step in a massive string matching task I have subjected myself to.

> import pandas as pd
  from fuzzywuzzy import process
  from fuzzywuzzy import fuzz 
  import csv

  df = pd.read_csv('yescorp_fuzz.csv')
  test_list = df.CorporationName
  test_list1 = df.CorporationName1


  def ownermatch():
   for i in test_list:
     result = process.extract(i,test_list1, limit=5)
     print(result)


   ownermatch()

This is the traceback error:

Traceback (most recent call last):
  File "C:/Python34/YesCorpFuzzy4_15.py", line 17, in <module>
    ownermatch()
  File "C:/Python34/YesCorpFuzzy4_15.py", line 13, in ownermatch
    result = process.extract(i,test_list1, limit=5)
  File "C:\Python34\lib\site-packages\fuzzywuzzy\process.py", line 103, in extract
    processed = processor(choice)
  File "C:\Python34\lib\site-packages\fuzzywuzzy\utils.py", line 84, in full_process
    string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
  File "C:\Python34\lib\site-packages\fuzzywuzzy\string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
    return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer
>>>

To be perfectly honest, I'm not sure what's going on here. I couldn't find much on the internet, either.

Any help that you could provide would be greatly appreciated.

Thanks!

Sam · Accepted Answer · 2016-04-19 13:27:21Z

1

I think youre running into a situation where you have a null value or some non-string data type in one of the dataframe columns. FuzzyWuzzy expects a string and when it encounters a NaN or another non-string, it throws the error. You could get rid of this by filling in the NaN's with the other column's value:

df.CorporationName.fillna(df.CorporationName1, inplace = True)
df.CorporationName1.fillna(df.CorporationName, inplace = True)

Or converting non-strings:

df.loc[:, 'CorporationName'] = df.CorporationName.astype(str)

edited Apr 19, 2016 at 13:27

answered Apr 15, 2016 at 21:35

Sam

4,09023 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Steven Over a year ago

Hi @Sam, strangely it is still returning the same error. Apparently it is something else.

Sam Over a year ago

@Steven could you possible have other datatypes in your df?

Steven Over a year ago

that seemed to be it. Thanks!

Sam Over a year ago

Cool glad to help. If ya wanna accept my answer or at least gimme an upvote here I'd appreciate it ;)

Steven Over a year ago

absolutely, sorry I forgot! Any reason you use df.loc for converting the strings?

|

Collectives™ on Stack Overflow

TypeError when using FuzzyWuzzy and Pandas for string matching

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related