0

I have two Excel sheets (Master & Input) with the same index column but a different number of columns (see below). I want to merge the Input DF into the Master DF if new rows have been added (see ID 103-105) OR an item in the Input DF has been updated (see ID 102). Other columns can be ignored.

Dataframe 1 (Master):

Master DF

Dataframe 2 (Input):

Input DF

Goal (updated cells marked in yellow):

enter image description here

I am using the following script:

inputDf = pd.read_excel(inputFileName).set_index("ID")
masterDf = pd.read_excel(masterFileName).set_index("ID")

# Update existing rows
masterDf.update(inputDf)

# find out which ids are new
ids_of_new_rows = set(inputDf.index) - set(masterDf.index)

# get new rows that should be added to master
rows_to_add = masterDf.loc[ids_of_new_rows, inputDf.columns & masterDf.columns]

I am able to update the Master DF and get ids_of_new_rows. Output: {'CR103', 'CR104', 'CR105'}

However, when trying to get rows_to_add, I always receive the following error:

KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['CR103', 'CR104', 'CR105'], dtype='object', name='ID')] are in the [index]"

Any ideas?

2
  • it should be rows_to_add = inputDf.loc etc etc, but you are pointing to masterDf there. This is where the mistake is. Commented Nov 9, 2020 at 17:18
  • Thank you, @SandervandenOord. What a stupid mistake. Commented Nov 10, 2020 at 7:24

2 Answers 2

1

About the error

The error comes from the fact that there are not rows with ID of ['CR103', 'CR104', 'CR105'] in the masterDf, but in the inputDf. What you are trying to do is probably

rows_to_add = inputDf.loc[ids_of_new_rows, inputDf.columns & masterDf.columns]

What you probably want to do

inputDf = pd.read_excel(inputFileName).set_index("ID")
masterDf = pd.read_excel(masterFileName).set_index("ID")

# Update existing rows
masterDf.update(inputDf)
# Add new rows
masterDf = pd.concat((masterDf, inputDf.loc[inputDf.index.difference(masterDf.index), inputDf.columns & masterDf.columns]))

Here the Index.difference is used to get the index values in inputDf that are not present in masterDf.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. Using your solution, additional columns from Input DF are getting added to the Master DF as well. Those should be ignored. Hence, the correct solution is to use the corrected rows_to_add in the concat statement, like this: df_result = pd.concat([masterDf, rows_to_add])
Oh yeah missed that you'd like to drop the extra columns from inputDf. Updated the answer.
0

Here is the correct script to achieve the outcome described below. Simple solution was to change inputDF and masterDF...

# Define DataFrame
inputDf = pd.read_excel(inputFileName).set_index("ID")
masterDf = pd.read_excel(masterFileName).set_index("ID")

# Update existing rows
masterDf.update(inputDf)

# find out which ids are new
ids_of_new_rows = set(inputDf.index) - set(masterDf.index)

# get new rows that should be added to master
rows_to_add = inputDf.loc[ids_of_new_rows, inputDf.columns & masterDf.columns]

# add new rows to existing master
df_result = pd.concat([masterDf, rows_to_add])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.