Python - fuzzy string matching - TypeError: expected string or bytes-like object

Question

I am trying to fuzzy merge two dataframes in Python using the code below:

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
    prospectus_data_file = 'file1.xlsx'
    filings_data_file = 'file2.xlsx'
    prospectus = pd.read_excel(prospectus_data_file)
    filings = pd.read_excel(filings_data_file)
    #all_data_st = pd.merge(prospectus, filings, on='NamePeriod')   
    filings['key']=filings.NamePeriod.apply(lambda x : [process.extract(x, prospectus.NamePeriod, limit=1)][0][0][0])
    all_data_st = filings.merge(prospectus,left_on='key',right_on='NamePeriod')
    all_data_st.to_excel('merged_file_fuzzy.xlsx')

The idea is to fuzzy merge based on two columns of each dataframe, Name and Year. I tried to combine these two in one field (NamePeriod) and then merge on that, but I am getting the following error:

TypeError: expected string or bytes-like object

Any idea how to perform this fuzzy merge? Here is how these columns look in the dataframes:

print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])



print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
                                               Name  ...                  NamePeriod
0                                               NaN  ...                         NaN
1                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2019
2                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2018
3                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2017
4                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2016
                                            ...  ...                         ...
15922                   Huitao Technology Co., Ltd.  ...                         NaN
15923                       Leaping Group Co., Ltd.  ...                         NaN
15924                                    PUYI, INC.  ...                         NaN
15925  Puhui Wealth Investment Management Co., Ltd.  ...                         NaN
15926                           Tidal Royalty Corp.  ...                         NaN

[15927 rows x 3 columns]
           prospectus_issuer_name  fyear                        NamePeriod
0                  ALCAN ALUM LTD   1990               ALCAN ALUM LTD 1990
1                  ALCAN ALUM LTD   1991               ALCAN ALUM LTD 1991
2                  ALCAN ALUM LTD   1992               ALCAN ALUM LTD 1992
3               AMOCO CDA PETE CO   1992            AMOCO CDA PETE CO 1992
4               AMOCO CDA PETE CO   1992            AMOCO CDA PETE CO 1992
                          ...    ...                               ...
1798               KOREA GAS CORP   2016               KOREA GAS CORP 2016
1799               KOREA GAS CORP   2016               KOREA GAS CORP 2016
1800          PETROLEOS MEXICANOS   2016          PETROLEOS MEXICANOS 2016
1801          PETROLEOS MEXICANOS   2016          PETROLEOS MEXICANOS 2016
1802  BOC AVIATION PTE LTD GLOBAL   2016  BOC AVIATION PTE LTD GLOBAL 2016

[1803 rows x 3 columns]

Here is the full code I try to run:

import pandas as pd
from rapidfuzz import process, utils
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'


prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)



filings.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
prospectus.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([filings, prospectus], ignore_index=True)





from rapidfuzz import fuzz, utils

df3.dropna(subset = ["name"], inplace=True)
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
    for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
        if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
            df3.drop(i2, inplace=True)

df3.reset_index(inplace=True)

gives me an error IndexError: list index out of range

maxbachmann · Accepted Answer · 2020-05-20 19:53:14Z

1

To summarize the problem:

there are two DataFrames, that both have a key for the name and the year
you would like to merge the two DataFrames and remove all duplicate elements, with duplicate elements being elements, that have the same year and a very similar name

I am working with the following two example DataFrames:

import pandas as pd

df1 = pd.DataFrame({
    'Name': ['NAM PROPERTY INC.', 'NAM PROPERTY INC.', 'ALCAN ALUM LTD'],
    'Period': [2019, 2019, 2018]})

df2 = pd.DataFrame({
    'prospectus_issuer_name': ['NAM TAI PROPERTY INC.', 'ALCAN ALUM LTD', 'AMOCO CDA PETE CO'],
    'fyear': [2019, 2019, 1992]})

My approach towards this problem would be to start by concating the two data frames

df1.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
df2.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([df1, df2], ignore_index=True)

Afterwards it is possible to iterate over this new DataFrame an remove all duplicate rows. I am using RapidFuzz here, since it is faster than FuzzyWuzzy (I am the author). The following code is creating a list of preprocessed names ahead of time, since the entries might be used multiple times and the preprocessing is taking a big time of the runtime. Afterwards it is iterating over the rows and always compares it with all rows, that have a higher index (rows with a lower index are already compared, since ratio(a,b) == ratio(b,a)) and that have the correct year. Filtering on the correct year allows to run the slow string matching algorithm a lot less oftern. For all rows that have a similar year and a very similar name the first row is kept and the others are deleted. You might have to play around with the score_cutoff and the matching algorithm to see which one fits your needs the best.

from rapidfuzz import fuzz, utils

names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
    for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
        if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
            df3.drop(i2, inplace=True)

df3.reset_index(inplace=True)

edited May 20, 2020 at 19:53

answered May 20, 2020 at 11:02

maxbachmann

3,3551 gold badge16 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

adrCoder Over a year ago

I get an error AttributeError: module 'rapidfuzz.utils' has no attribute 'full_process' how do I get rid of that?

adrCoder Over a year ago

and I also get an error TypeError: argument 1 must be str, not float

maxbachmann Over a year ago

Oh sorry it's default_process I fixed the code ;) The same function is called full_process in fuzzywuzzy

maxbachmann Over a year ago

The second error means that there are some float values in there. I changed the example to perform a string conversion. So it should work now

adrCoder Over a year ago

Hi Max, I edited the initial post and pasted my code. It gives me an error TypeError: argument 1 must be str, not float any ideas how to fix it or am I doing something wrong?

|

Collectives™ on Stack Overflow

Python - fuzzy string matching - TypeError: expected string or bytes-like object

1 Answer 1

18 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

18 Comments

Your Answer

Sign up or log in

Post as a guest

Related