Pandas Merge DataFrame based on Two Columns

Question

I have two DataFrames that I am trying to merge to create a choropleth plot. A small subsection of each data frame is shown below:

DataFrame 1:

    COUNTYFP   TRACTCE
7   023        960100
8   023        960200
9   023        960300
52  024        960300
5   024        960402
4   031        960403
3   031        960404
6   031        960405

DataFrame 2:

      county    tract     percent
1640    23      960100    16.3562
1643    23      960200    15.6140
1646    23      960300    25.7558
1649    24      960300    40.3279
1652    24      960402    37.9966
1655    31      960403    34.1127
1658    31      960404    26.5466
1661    31      960405    29.2962

What I am trying to do here is merge both of these DataFrames so that the percent column from DF2 is added to the end of DF1 for its according values.

Two things here to note however:

I need to merge the df by two columns. There is a duplicate value for Tract (960300) therefore the df needs to be merged by the correct county and the correct tract.
the county is in a different numerical format across both data frames (one is in 023 and the other is in 23).

The desired output:

COUNTYFP   TRACTCE   percent
7   023    960100    16.3562
8   023    960200    15.6140
9   023    960300    ...
52  024    960300    ...
5   024    960402    ...
4   031    960403    ...
3   031    960404    ...
6   031    960405    ...

I can not just merge it by tract because 960300 appears twice. Similarly, I can not just merge it by county as 23 appears multiple times. Therefore, I need to combine each by using two different columns. I am a bit unsure how to do this.

My thoughts are along the lines of:

merged_df = df1.set_index(['COUNTYFP', 'TRACTCE']).join(df2.set_index(['county', 'tract']))

I am not sure if this will work though. Is this the correct approach? Also, how do I deal with the different numerical representation of the county value 023 vs 23 across both dfs?

Any thoughts, code, or links to examples/docs that you find helpful would be greatly appreciated.

Thanks!

Haleemur Ali · Accepted Answer · 2020-11-20 00:55:58Z

2

convert df1.COUNTYFP to an integer to make the representations the same. 023 suggests that the column has a string type.

df1.COUNTYFP = df1.COUNTYFP.astype('int')

use df1.merge(df2, ...) specifying a list of columns in the left_on & right_on arguments.

df1.merge(df2, left_on=['COUNTYFP', 'TRACTCE'], right_on=['county', 'tract'], how='left')

# outputs:

      county   tract  percent
1640      23  960100  16.3562
1643      23  960200  15.6140
1646      23  960300  25.7558
1649      24  960300  40.3279
1652      24  960402  37.9966
1655      31  960403  34.1127
1658      31  960404  26.5466
1661      31  960405  29.2962

answered Nov 20, 2020 at 0:55

Haleemur Ali

28.6k6 gold badges67 silver badges89 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas Merge DataFrame based on Two Columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related