I am trying to fuzzy merge two dataframes in Python using the code below:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
#all_data_st = pd.merge(prospectus, filings, on='NamePeriod')
filings['key']=filings.NamePeriod.apply(lambda x : [process.extract(x, prospectus.NamePeriod, limit=1)][0][0][0])
all_data_st = filings.merge(prospectus,left_on='key',right_on='NamePeriod')
all_data_st.to_excel('merged_file_fuzzy.xlsx')
The idea is to fuzzy merge based on two columns of each dataframe, Name and Year. I tried to combine these two in one field (NamePeriod) and then merge on that, but I am getting the following error:
TypeError: expected string or bytes-like object
Any idea how to perform this fuzzy merge? Here is how these columns look in the dataframes:
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
print(filings[['Name', 'Period','NamePeriod']])
print(prospectus[['prospectus_issuer_name', 'fyear','NamePeriod']])
Name ... NamePeriod
0 NaN ... NaN
1 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2019
2 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2018
3 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2017
4 NAM TAI PROPERTY INC. ... NAM TAI PROPERTY INC. 2016
... ... ...
15922 Huitao Technology Co., Ltd. ... NaN
15923 Leaping Group Co., Ltd. ... NaN
15924 PUYI, INC. ... NaN
15925 Puhui Wealth Investment Management Co., Ltd. ... NaN
15926 Tidal Royalty Corp. ... NaN
[15927 rows x 3 columns]
prospectus_issuer_name fyear NamePeriod
0 ALCAN ALUM LTD 1990 ALCAN ALUM LTD 1990
1 ALCAN ALUM LTD 1991 ALCAN ALUM LTD 1991
2 ALCAN ALUM LTD 1992 ALCAN ALUM LTD 1992
3 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
4 AMOCO CDA PETE CO 1992 AMOCO CDA PETE CO 1992
... ... ...
1798 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1799 KOREA GAS CORP 2016 KOREA GAS CORP 2016
1800 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1801 PETROLEOS MEXICANOS 2016 PETROLEOS MEXICANOS 2016
1802 BOC AVIATION PTE LTD GLOBAL 2016 BOC AVIATION PTE LTD GLOBAL 2016
[1803 rows x 3 columns]
Here is the full code I try to run:
import pandas as pd
from rapidfuzz import process, utils
prospectus_data_file = 'file1.xlsx'
filings_data_file = 'file2.xlsx'
prospectus = pd.read_excel(prospectus_data_file)
filings = pd.read_excel(filings_data_file)
filings.rename(columns={'Name': 'name', 'Period': 'year'}, inplace=True)
prospectus.rename(columns={'prospectus_issuer_name': 'name', 'fyear': 'year'}, inplace=True)
df3 = pd.concat([filings, prospectus], ignore_index=True)
from rapidfuzz import fuzz, utils
df3.dropna(subset = ["name"], inplace=True)
names = [utils.default_process(x) for x in df3['name']]
for i1, row1 in df3.iterrows():
for i2 in df3.loc[(df3['year'] == row1['year']) & (df3.index > i1)].index:
if fuzz.WRatio(names[i1], names[i2], processor=None, score_cutoff=90):
df3.drop(i2, inplace=True)
df3.reset_index(inplace=True)
gives me an error IndexError: list index out of range