Pandas: KeyError when trying to merge two dataframes

Question

I have two Excel sheets (Master & Input) with the same index column but a different number of columns (see below). I want to merge the Input DF into the Master DF if new rows have been added (see ID 103-105) OR an item in the Input DF has been updated (see ID 102). Other columns can be ignored.

Dataframe 1 (Master):

Dataframe 2 (Input):

Goal (updated cells marked in yellow):

I am using the following script:

inputDf = pd.read_excel(inputFileName).set_index("ID")
masterDf = pd.read_excel(masterFileName).set_index("ID")

# Update existing rows
masterDf.update(inputDf)

# find out which ids are new
ids_of_new_rows = set(inputDf.index) - set(masterDf.index)

# get new rows that should be added to master
rows_to_add = masterDf.loc[ids_of_new_rows, inputDf.columns & masterDf.columns]

I am able to update the Master DF and get ids_of_new_rows. Output: {'CR103', 'CR104', 'CR105'}

However, when trying to get rows_to_add, I always receive the following error:

KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['CR103', 'CR104', 'CR105'], dtype='object', name='ID')] are in the [index]"

Any ideas?

it should be rows_to_add = inputDf.loc etc etc, but you are pointing to masterDf there. This is where the mistake is. — Sander van den Oord
– Sander van den Oord, Commented Nov 9, 2020 at 17:18

Niko Fohr · Accepted Answer · 2020-11-10 08:00:36Z

1

About the error

The error comes from the fact that there are not rows with ID of ['CR103', 'CR104', 'CR105'] in the masterDf, but in the inputDf. What you are trying to do is probably

rows_to_add = inputDf.loc[ids_of_new_rows, inputDf.columns & masterDf.columns]

What you probably want to do

inputDf = pd.read_excel(inputFileName).set_index("ID")
masterDf = pd.read_excel(masterFileName).set_index("ID")

# Update existing rows
masterDf.update(inputDf)
# Add new rows
masterDf = pd.concat((masterDf, inputDf.loc[inputDf.index.difference(masterDf.index), inputDf.columns & masterDf.columns]))

Here the Index.difference is used to get the index values in inputDf that are not present in masterDf.

edited Nov 10, 2020 at 8:00

answered Nov 9, 2020 at 17:19

Niko Fohr

35.3k12 gold badges113 silver badges117 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

maxiw46 Over a year ago

Thank you. Using your solution, additional columns from Input DF are getting added to the Master DF as well. Those should be ignored. Hence, the correct solution is to use the corrected rows_to_add in the concat statement, like this: df_result = pd.concat([masterDf, rows_to_add])

Niko Fohr Over a year ago

Oh yeah missed that you'd like to drop the extra columns from inputDf. Updated the answer.

maxiw46 · Accepted Answer · 2020-11-10 07:26:55Z

0

Here is the correct script to achieve the outcome described below. Simple solution was to change inputDF and masterDF...

# Define DataFrame
inputDf = pd.read_excel(inputFileName).set_index("ID")
masterDf = pd.read_excel(masterFileName).set_index("ID")

# Update existing rows
masterDf.update(inputDf)

# find out which ids are new
ids_of_new_rows = set(inputDf.index) - set(masterDf.index)

# get new rows that should be added to master
rows_to_add = inputDf.loc[ids_of_new_rows, inputDf.columns & masterDf.columns]

# add new rows to existing master
df_result = pd.concat([masterDf, rows_to_add])

answered Nov 10, 2020 at 7:26

maxiw46

1331 gold badge4 silver badges11 bronze badges

Collectives™ on Stack Overflow

Pandas: KeyError when trying to merge two dataframes

2 Answers 2

About the error

What you probably want to do

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

About the error

What you probably want to do

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related