I'm trying to compare address values for inaccuracies, for example, given multiple records like:
| Reference | Apartment | Address | PostCode |
|---|---|---|---|
| AS097 | NaN | 00 Name Road | BH1 4HB |
| AS097 | Flat 1 Building Name | 00 Name Road | BH1 4HB |
| AS097 | Flat 2 Building Name | 00 Name Rd | BH1 4HB |
| AS097 | Flat 3 Building Name | 00 Name Road | BH1 4HB |
| AS097 | Flat 4 Building Name | 00 Name Road | BH1 4HB |
| AS097 | Flat 5 Building Name | 00 Name Road | BH1 4HX |
| HO056 | NaN | 23 Street Road | XG9 9GX |
I've a dataframe where I store all the "main" addresses by checking if the Apartment column is empty like so:
main_address = df['Apartment'].isnull()
df_st = pd.DataFrame({'Reference':df[main_address].Reference, 'Address':df[main_address].Address, 'PostCode':df[main_address].PostCode})
df_st will look like this:
| Reference | Apartment | Address | PostCode |
|---|---|---|---|
| AS097 | NaN | 00 Name Road | BH1 4HB |
| HO056 | NaN | 23 Street Road | XG9 9GX |
df has over 1K records, but df_st containing the "main" address ends up with approx. 200 records.
I'm trying to create a new DataFrame where I can identify where the records don't match by domparing df against df_st.
THE PROBLEM
I try the below:
# Clean the Reference values
refs_list = df['Reference'].str.split('/').str[0]
df['Reference'] = refs_list
# Create a new column titled issues and flag if the references match
df['issues'] = np.where(df['Reference'] == df_st['Reference'], 'True', 'False')
I want the above for the Address and PostCode unfortunately this does not work since df and df_st don't have the same shape.
I'm struggling to find a way to achieve a comparison between the two DataFrames df against df_st.
I want to compare all matching Reference row values from df against it's matching from df_st and if one of them don't match create a new column title Issues and store the conflicting column there.
MY DESIRED OUTCOME
Given the data above, after comparing df data against df_st results in a new DataFrame, like below
| Reference | Apartment | Address | PostCode | Issues |
|---|---|---|---|---|
| AS097 | NaN | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 1 Building Name | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 2 Building Name | 00 Name Rd | BH1 4HB | Address |
| AS097 | Flat 3 Building Name | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 4 Building Name | 00 Name Road | BH1 4HB | None |
| AS097 | Flat 5 Building Name | 00 Name Road | BH1 4HX | PostCode |
| HO056 | NaN | 23 Street Road | XG9 9GX | None |
Where Address appears as an issue in the column Issues since the address don't match against df_st, same for PostCode since it differs from df_st
IN A NUT SHELL
All I want to to know how to compare matching rows by Reference from a DataFrame against another and compare the other values Address and PostCode.
Hope that makes sense.
df_stto thedfdataframe usingpandas.DataFrame.mergeand then simply compare the rows as you would normally inpandas. $\endgroup$dfcontain the true data? For example, for AS097, how does one tell if the correct address is00 Name Roador00 Name Rd? Similarly forBH1 4HXversusBH1 4HB. Is it correct to assume that we look todffor the correct PostCode or Address? $\endgroup$df_stcontains the correct data to compare against with,Referenceis not a reliable data source, butAddressindf_stis $\endgroup$