0

I am trying to fuzzy merge two dataframes in Python using the code below:

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
    prospectus_data_file = 'file1.xlsx'
    filings_data_file = 'file2.xlsx'
    prospectus = pd.read_excel(prospectus_data_file)
    filings = pd.read_excel(filings_data_file)
    #all_data_st = pd.merge(prospectus, filings, on='NamePeriod')   
    filings['key']=filings.NamePeriod.apply(lambda x : [process.extract(x, prospectus.NamePeriod, limit=1)][0][0][0])
    all_data_st = filings.merge(prospectus,left_on='key',right_on='NamePeriod')
    all_data_st.to_excel('merged_file_fuzzy.xlsx')

The idea is to fuzzy merge based on two columns of each dataframe, Name and Year. I tried to combine these two in one field (NamePeriod) and then merge on that, but I am getting the following error:

TypeError: expected string or bytes-like object

Any idea how to perform this fuzzy merge? Here is how these columns look in the dataframes:

print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])



print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
                                               Name  ...                  NamePeriod
0                                               NaN  ...                         NaN
1                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2019
2                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2018
3                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2017
4                             NAM TAI PROPERTY INC.  ...  NAM TAI PROPERTY INC. 2016
                                            ...  ...                         ...
15922                   Huitao Technology Co., Ltd.  ...                         NaN
15923                       Leaping Group Co., Ltd.  ...                         NaN
15924                                    PUYI, INC.  ...                         NaN
15925  Puhui Wealth Investment Management Co., Ltd.  ...                         NaN
15926                           Tidal Royalty Corp.  ...                         NaN

[15927 rows x 3 columns]
           prospectus_issuer_name  fyear                        NamePeriod
0                  ALCAN ALUM LTD   1990               ALCAN ALUM LTD 1990
1                  ALCAN ALUM LTD   1991               ALCAN ALUM LTD 1991
2                  ALCAN ALUM LTD   1992               ALCAN ALUM LTD 1992
3               AMOCO CDA PETE CO   1992            AMOCO CDA PETE CO 1992
4               AMOCO CDA PETE CO   1992            AMOCO CDA PETE CO 1992
                          ...    ...                               ...
1798               KOREA GAS CORP   2016               KOREA GAS CORP 2016
1799               KOREA GAS CORP   2016               KOREA GAS CORP 2016
1800          PETROLEOS MEXICANOS   2016          PETROLEOS MEXICANOS 2016
1801          PETROLEOS MEXICANOS   2016          PETROLEOS MEXICANOS 2016
1802  BOC AVIATION PTE LTD GLOBAL   2016  BOC AVIATION PTE LTD GLOBAL 2016

[1803 rows x 3 columns]

Here is the full code I try to run:

import pandas as pd
from rapidfuzz import process, utils
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'


prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)



filings.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
prospectus.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([filings, prospectus], ignore_index=True)





from rapidfuzz import fuzz, utils

df3.dropna(subset = ["name"], inplace=True)
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
    for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
        if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
            df3.drop(i2, inplace=True)

df3.reset_index(inplace=True)

gives me an error IndexError: list index out of range

1 Answer 1

1

To summarize the problem:

  • there are two DataFrames, that both have a key for the name and the year

  • you would like to merge the two DataFrames and remove all duplicate elements, with duplicate elements being elements, that have the same year and a very similar name

I am working with the following two example DataFrames:

import pandas as pd

df1 = pd.DataFrame({
    'Name': ['NAM PROPERTY INC.', 'NAM PROPERTY INC.', 'ALCAN ALUM LTD'],
    'Period': [2019, 2019, 2018]})

df2 = pd.DataFrame({
    'prospectus_issuer_name': ['NAM TAI PROPERTY INC.', 'ALCAN ALUM LTD', 'AMOCO CDA PETE CO'],
    'fyear': [2019, 2019, 1992]})

My approach towards this problem would be to start by concating the two data frames

df1.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
df2.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([df1, df2], ignore_index=True)

Afterwards it is possible to iterate over this new DataFrame an remove all duplicate rows. I am using RapidFuzz here, since it is faster than FuzzyWuzzy (I am the author). The following code is creating a list of preprocessed names ahead of time, since the entries might be used multiple times and the preprocessing is taking a big time of the runtime. Afterwards it is iterating over the rows and always compares it with all rows, that have a higher index (rows with a lower index are already compared, since ratio(a,b) == ratio(b,a)) and that have the correct year. Filtering on the correct year allows to run the slow string matching algorithm a lot less oftern. For all rows that have a similar year and a very similar name the first row is kept and the others are deleted. You might have to play around with the score_cutoff and the matching algorithm to see which one fits your needs the best.

from rapidfuzz import fuzz, utils

names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
    for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
        if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
            df3.drop(i2, inplace=True)

df3.reset_index(inplace=True)
Sign up to request clarification or add additional context in comments.

18 Comments

I get an error AttributeError: module 'rapidfuzz.utils' has no attribute 'full_process' how do I get rid of that?
and I also get an error TypeError: argument 1 must be str, not float
Oh sorry it's default_process I fixed the code ;) The same function is called full_process in fuzzywuzzy
The second error means that there are some float values in there. I changed the example to perform a string conversion. So it should work now
Hi Max, I edited the initial post and pasted my code. It gives me an error TypeError: argument 1 must be str, not float any ideas how to fix it or am I doing something wrong?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.