I have two DataFrames that I am trying to merge to create a choropleth plot. A small subsection of each data frame is shown below:
DataFrame 1:
COUNTYFP TRACTCE
7 023 960100
8 023 960200
9 023 960300
52 024 960300
5 024 960402
4 031 960403
3 031 960404
6 031 960405
DataFrame 2:
county tract percent
1640 23 960100 16.3562
1643 23 960200 15.6140
1646 23 960300 25.7558
1649 24 960300 40.3279
1652 24 960402 37.9966
1655 31 960403 34.1127
1658 31 960404 26.5466
1661 31 960405 29.2962
What I am trying to do here is merge both of these DataFrames so that the percent column from DF2 is added to the end of DF1 for its according values.
Two things here to note however:
I need to merge the df by two columns. There is a duplicate value for Tract (960300) therefore the df needs to be merged by the correct county and the correct tract.
the county is in a different numerical format across both data frames (one is in 023 and the other is in 23).
The desired output:
COUNTYFP TRACTCE percent
7 023 960100 16.3562
8 023 960200 15.6140
9 023 960300 ...
52 024 960300 ...
5 024 960402 ...
4 031 960403 ...
3 031 960404 ...
6 031 960405 ...
I can not just merge it by tract because 960300 appears twice. Similarly, I can not just merge it by county as 23 appears multiple times. Therefore, I need to combine each by using two different columns. I am a bit unsure how to do this.
My thoughts are along the lines of:
merged_df = df1.set_index(['COUNTYFP', 'TRACTCE']).join(df2.set_index(['county', 'tract']))
I am not sure if this will work though. Is this the correct approach? Also, how do I deal with the different numerical representation of the county value 023 vs 23 across both dfs?
Any thoughts, code, or links to examples/docs that you find helpful would be greatly appreciated.
Thanks!