1

I have two CSV files from different sources with data from each county in the United States. I can merge the files into a single data frame with the common key for county name, but the two files use a different string format. For example, one file -- "df1" -- has "San Francisco" while the other file -- "df2" -- has "San Francisco County." I need to compare the two columns, identify the matches, and then create an identical key for the merge. For example, if df1 contains "San Francisco," I need to check all of the rows in df2, identify that "San Francisco County" contains "San Francisco," and then create a new key column in df2 that has just "San Francisco."

Here is what I tried.

I would merge the data frames with an inner join. The two CSV files do not have the same number of rows. I believe that the operation, below, will return only rows that have a match, which is what I want.

merged_df = pd.merge(left=df1, right=df2, left_on="COUNTY", right_on="COUNTY")

I am having trouble creating the common key. Here is what I have:

# Create an empty container to store the data
COUNTY_KEY = []

for row in df2['COUNTY']:
    if df1['COUNTY'] in df2['COUNTY']:
        COUNTY_KEY.APPEND(df1(['COUNTY']))
    else:
        COUNTY_KEY.append('0')

# Create the key by adding the new data to df2
df2['COUNTY_KEY'] = COUNTY_KEY

I believe the problem is that I'm trying to compare two objects -- the county variables in the two data frames -- but I need to get inside the objects and compare them at the string level.

I have read a lot of questions and answers about merging data frames, but I haven't found anyone ask this question -- how can we merge two data frames when a common key exists but is not an exact match?

Thank you for your help!

1 Answer 1

1

This should work,

df = pd.DataFrame({'id':[1,2,3,4,5],'county':['San Francisco County','San Jose County','Miami','Syracuse','Buffalo']})

df1 = pd.DataFrame({'id':[9,9,8,7,6],'county':['San Francisco','San Jos
e','Miami', 'Syracuse','Buffalo']})

def matched(target, val):
    target_vals = target['county'].values.tolist()
    for county in target_vals:
        if val in county:
            return county

df1['county_key'] = df1['county'].apply(lambda x: matched(df, x))

pd.merge(df, df1, how='inner', left_on='county',right_on='county_key')

           county_x       id_x       county_y  id_y            county_key
0  San Francisco County     1  San Francisco     9  San Francisco County
1       San Jose County     2       San Jose     9       San Jose County
2                 Miami     3          Miami     8                 Miami
3              Syracuse     4       Syracuse     7              Syracuse
4               Buffalo     5        Buffalo     6               Buffalo

I didn't format the final data frame but you should understand now how to accomplish your task.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.