Merge and fill missing values based on multiple columns from another dataframe in Python

Question

In order to merge two dataframes based on year and city, I want to fill missing values in df1 gdp_value and growth_rate from the values in gdp and rate respectively from df2.

df1

   year city  gdp_value  growth_rate
0  2015   sh        NaN          NaN
1  2016   sh        NaN          NaN
2  2017   sh        NaN          NaN
3  2018   sh        NaN          NaN
4  2019   sh        NaN          NaN
5  2015   bj        7.0         0.01
6  2016   bj        3.0         0.03
7  2017   bj        2.0        -0.03
8  2018   bj        5.0         0.05
9  2019   bj        4.0         0.02

df2

   year city  gdp  rate
0  2015   sh    6  0.04
1  2016   sh    5  0.07
2  2017   sh    3 -0.03
3  2018   sh    6  0.05
4  2019   sh    4  0.02

I have tried with pd.merge(df1, df2, on=['year', 'city'], how = 'left') and I got:

   year city  gdp_value  growth_rate  gdp  rate
0  2015   sh        NaN          NaN  6.0  0.04
1  2016   sh        NaN          NaN  5.0  0.07
2  2017   sh        NaN          NaN  3.0 -0.03
3  2018   sh        NaN          NaN  6.0  0.05
4  2019   sh        NaN          NaN  4.0  0.02
5  2015   bj        7.0         0.01  NaN   NaN
6  2016   bj        3.0         0.03  NaN   NaN
7  2017   bj        2.0        -0.03  NaN   NaN
8  2018   bj        5.0         0.05  NaN   NaN
9  2019   bj        4.0         0.02  NaN   NaN

My desired output df is like this:

   year city  gdp_value  ratio_rate
0  2015   sh          6        0.04
1  2016   sh          5        0.07
2  2017   sh          3       -0.03
3  2018   sh          6        0.05
4  2019   sh          4        0.02
5  2015   bj          7        0.01
6  2016   bj          3        0.03
7  2017   bj          2       -0.03
8  2018   bj          5        0.05
9  2019   bj          4        0.02

Thanks for your help.

Edited, this solution seems works out, thanks:

df1 = df1.set_index(['year', 'city'])
df1.update(
    df2
    .set_index(['year', 'city'])\
    .rename(columns={'gdp':'gdp_value','rate':'growth_rate'})\
)
df1 = df1.reset_index()

Scott Boston · Accepted Answer · 2019-10-31 13:03:15Z

2

As mentioned in the question you can also use update depending on your data and needs:

df1 = df1.set_index(['year', 'city'])
df1.update(
    df2
    .set_index(['year', 'city'])\
    .rename(columns={'gdp':'gdp_value','rate':'growth_rate'})\
)
df1 = df1.reset_index()

One way is to use combine_first with set_index and column renaming:

df1.set_index(['year','city'])\
   .combine_first(df2.set_index(['year','city'])
                     .rename(columns={'gdp':'gdp_value','rate':'growth_rate'}))\
   .reset_index()

Output:

   year city  gdp_value  growth_rate
0  2015   bj        7.0         0.01
1  2015   sh        6.0         0.04
2  2016   bj        3.0         0.03
3  2016   sh        5.0         0.07
4  2017   bj        2.0        -0.03
5  2017   sh        3.0        -0.03
6  2018   bj        5.0         0.05
7  2018   sh        6.0         0.05
8  2019   bj        4.0         0.02
9  2019   sh        4.0         0.02

edited Oct 31, 2019 at 13:03

answered Oct 31, 2019 at 3:27

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Scott Boston Over a year ago

There are lots of ways to do this problem, however the keys are to set_index and renaming the columns to match in each dataframe. pandas does almost all of its operations using index alignment.

ah bon Over a year ago

Soorry, I get TypeError: Cannot compare type Period with type str with real data

Scott Boston Over a year ago

You need to have the dtypes of each columb to match also. You have in one dataframe dtype time period and the other string.

ah bon Over a year ago

Another issue, some other columns values in df1 are becoming NaNs after combine_first.

Scott Boston Over a year ago

Hrm.... That shouldn't happen. Can you start a new question with a dataset that shows this behavior?

|

Collectives™ on Stack Overflow

Merge and fill missing values based on multiple columns from another dataframe in Python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related