72

I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.

Here is an example with same data and code:

DataFrame 1 :

enter image description here

DataFrame 2:

enter image description here

I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:

enter image description here

Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)

import pandas as pd

data1 = {
         'Code': [1, 2, 3],
         'Name': ['Company1', 'Company2', 'Company3'],
         'Value': [200, 300, 400],

    }
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])

data2 = {
         'Code': [2],
         'Name': ['Company2'],
         'Value': [1000],
    }

df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])

Any pointers or hints?

10 Answers 10

86

Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):

>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index()  # to recover the initial structure

   Code      Name   Value
0     1  Company1   200.0
1     2  Company2  1000.0
2     3  Company3   400.0
Sign up to request clarification or add additional context in comments.

3 Comments

This seems to be the most ideal solution among all... but Nic, can you help me with one thing?... what if df1 and df2 had 5 columns in each, but I wanted to update only the "Value" column, and not the rest of them(above code updates all columns pertaining to that "index")... is that possible pls? Kindly help ...
Why is Value column converted to float?
This was the solution I was looking for. You can also expand this to multiple lookup columns: df1.set_index(['Code', 'Name'], inplace=True) and updates multiple measure columns in case you have e.g. Value, Sales, etc.
43

You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2

pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]: 
   Code      Name  Value
0     1  Company1    200
0     2  Company2   1000
2     3  Company3    400

Update due to below comments

df1.set_index(['Code', 'Name'], inplace=True)

df1.update(df2.set_index(['Code', 'Name']))

df1.reset_index(drop=True, inplace=True)

2 Comments

Just want to point out that this solution not only updates the entries frame dataframe1 but also adds new entries from dataframe2 which were not present in dataframe1 before.
It also blows up the memory as it has to make a duplicate of both dataframes before dropping the duplicates.
15

You can merge the data first and then use numpy.where, here's how to use numpy.where

updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)

   Code      Name   Value
0     1  Company1   200.0
1     2  Company2  1000.0
2     3  Company3   400.0

1 Comment

Thanks. So Left join and then update 'Value' field with 'Value_new' for non NaN rows.
14

There is a update function available

example:

df1.update(df2)

for more info:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html

2 Comments

older same and better answer exists already
It would be required to set_index first to use update reliably.
10

You can align indices and then use combine_first:

res = df2.set_index(['Code', 'Name'])\
         .combine_first(df1.set_index(['Code', 'Name']))\
         .reset_index()

print(res)

#    Code      Name   Value
# 0     1  Company1   200.0
# 1     2  Company2  1000.0
# 2     3  Company3   400.0

2 Comments

This is not a valid answer, cause: Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two. pandas.pydata.org/pandas-docs/stable/reference/api/… @safiqul islam mentioned below the update function, which seeems to work. pandas.pydata.org/pandas-docs/stable/reference/api/…
@CorinaRosa Can you give a counter-example?
4

Assuming company and code are redundant identifiers, you can also do

import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()

df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)

#   Code      Name  Value
#0     1  Company1    200
#1     2  Company2   1000
#2     3  Company3    400

Comments

4

There's something I often do.

I merge 'left' first:

df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')

Pandas will create columns with extension '_x' (for your left dataframe) and '_y' (for your right dataframe)

You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':

for col in df_merged.columns:
    if '_x' in col:
        df_merged .drop(columns = col, inplace = True)
    if '_y' in col:
        new_name = col.strip('_y')
        df_merged .rename(columns = {col : new_name }, inplace=True)

Comments

3

You can use pd.Series.where on the result of left-joining df1 and df2

merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
    Code    Name    Value
0   1   Company1    200.0
1   2   Company2    1000.0
2   3   Company3    400.0

You can change the line to

df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)

in order to return the value to be an integer.

2 Comments

Why is it adding .0 to the value ? (Not a big deal, but just curious)
@ProgSky It is because the type changed. I updated the answer to show how to return it to int.
2
  1. Append the dataset
  2. Drop the duplicate by code
  3. Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')

Comments

2

None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution

indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.