2

In order to merge two dataframes based on year and city, I want to fill missing values in df1 gdp_value and growth_rate from the values in gdp and rate respectively from df2.

df1

   year city  gdp_value  growth_rate
0  2015   sh        NaN          NaN
1  2016   sh        NaN          NaN
2  2017   sh        NaN          NaN
3  2018   sh        NaN          NaN
4  2019   sh        NaN          NaN
5  2015   bj        7.0         0.01
6  2016   bj        3.0         0.03
7  2017   bj        2.0        -0.03
8  2018   bj        5.0         0.05
9  2019   bj        4.0         0.02

df2

   year city  gdp  rate
0  2015   sh    6  0.04
1  2016   sh    5  0.07
2  2017   sh    3 -0.03
3  2018   sh    6  0.05
4  2019   sh    4  0.02

I have tried with pd.merge(df1, df2, on=['year', 'city'], how = 'left') and I got:

   year city  gdp_value  growth_rate  gdp  rate
0  2015   sh        NaN          NaN  6.0  0.04
1  2016   sh        NaN          NaN  5.0  0.07
2  2017   sh        NaN          NaN  3.0 -0.03
3  2018   sh        NaN          NaN  6.0  0.05
4  2019   sh        NaN          NaN  4.0  0.02
5  2015   bj        7.0         0.01  NaN   NaN
6  2016   bj        3.0         0.03  NaN   NaN
7  2017   bj        2.0        -0.03  NaN   NaN
8  2018   bj        5.0         0.05  NaN   NaN
9  2019   bj        4.0         0.02  NaN   NaN

My desired output df is like this:

   year city  gdp_value  ratio_rate
0  2015   sh          6        0.04
1  2016   sh          5        0.07
2  2017   sh          3       -0.03
3  2018   sh          6        0.05
4  2019   sh          4        0.02
5  2015   bj          7        0.01
6  2016   bj          3        0.03
7  2017   bj          2       -0.03
8  2018   bj          5        0.05
9  2019   bj          4        0.02

Thanks for your help.

Edited, this solution seems works out, thanks:

df1 = df1.set_index(['year', 'city'])
df1.update(
    df2
    .set_index(['year', 'city'])\
    .rename(columns={'gdp':'gdp_value','rate':'growth_rate'})\
)
df1 = df1.reset_index()

1 Answer 1

2

As mentioned in the question you can also use update depending on your data and needs:

df1 = df1.set_index(['year', 'city'])
df1.update(
    df2
    .set_index(['year', 'city'])\
    .rename(columns={'gdp':'gdp_value','rate':'growth_rate'})\
)
df1 = df1.reset_index()

One way is to use combine_first with set_index and column renaming:

df1.set_index(['year','city'])\
   .combine_first(df2.set_index(['year','city'])
                     .rename(columns={'gdp':'gdp_value','rate':'growth_rate'}))\
   .reset_index()

Output:

   year city  gdp_value  growth_rate
0  2015   bj        7.0         0.01
1  2015   sh        6.0         0.04
2  2016   bj        3.0         0.03
3  2016   sh        5.0         0.07
4  2017   bj        2.0        -0.03
5  2017   sh        3.0        -0.03
6  2018   bj        5.0         0.05
7  2018   sh        6.0         0.05
8  2019   bj        4.0         0.02
9  2019   sh        4.0         0.02
Sign up to request clarification or add additional context in comments.

6 Comments

There are lots of ways to do this problem, however the keys are to set_index and renaming the columns to match in each dataframe. pandas does almost all of its operations using index alignment.
Soorry, I get TypeError: Cannot compare type Period with type str with real data
You need to have the dtypes of each columb to match also. You have in one dataframe dtype time period and the other string.
Another issue, some other columns values in df1 are becoming NaNs after combine_first.
Hrm.... That shouldn't happen. Can you start a new question with a dataset that shows this behavior?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.