2

I have been searching a long time for an answer but could not find it. I have two dataframes, one is target, the other backup which both have the same columns. What I want to do is to look at a given column and add all the rows from backup to target which are not in target. The most straightforward solution for this is:

import pandas as pd
import numpy as np

target = pd.DataFrame({
         "key1": ["K1", "K2", "K3", "K5"],
         "A": ["A1", "A2", "A3", np.nan],
         "B": ["B1", "B2", "B3", "B5"],
     })

backup = pd.DataFrame({
         "key1": ["K1", "K2", "K3", "K4", "K5"],
         "A": ["A1", "A", "A3", "A4", "A5"],
         "B": ["B1", "B2", "B3", "B4", "B5"],
     })

merged = target.copy()

for item in backup.key1.unique():
    if item not in target.key1.unique():
        merged = pd.concat([merged, backup.loc[backup.key1 == item]])

merged.reset_index(drop=True, inplace=True)

giving

  key1    A   B
0   K1   A1  B1
1   K2   A2  B2
2   K3   A3  B3
3   K5  NaN  B5
4   K4   A4  B4

Now I have tried several things using just pandas where none of them works.

  1. pandas concat
# Does not work because it creates duplicate lines and if dropped, the updated rows which are different will not be dropped -- compare the line with A or NaN

pd.concat([target, backup]).drop_duplicates()

  key1  A   B
0   K1  A1  B1
1   K2  A2  B2
2   K3  A3  B3
3   K5  NaN B5
1   K2  A   B2
3   K4  A4  B4
4   K5  A5  B5
  1. pandas merge
# Does not work because the backup would overwrite data in the target -- NaN

pd.merge(target, backup, how="right")

  key1  A   B
0   K1  A1  B1
1   K2  A   B2
2   K3  A3  B3
3   K4  A4  B4
4   K5  A5  B5
  1. Importantly, it is not a duplicate of this post since I do not want to have a new column and more importantly, the values are not NaN in target, they are simply not there. Furthermore, if then I would use what is proposed for merging the columns, the NaN in the target would be replaced by the value in backup which is unwanted.

  2. It is not a duplicate of this post which uses the combine_first pandas because in that case the NaN is filled by the value from the backup which is wrong:

target.combine_first(backup)

   key1 A   B
0   K1  A1  B1
1   K2  A2  B2
2   K3  A3  B3
3   K5  A4  B5
4   K5  A5  B5
  1. Lastly,
target.join(backup, on=["key1"])

gives me an annoying

ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

which I really do not get since both are pure strings and the proposed solution does not work.

So I would like to ask, what am I missing? How can I do it using some pandas methods? Thanks a lot.

2 Answers 2

2

Use concat with filtered backup rows that do not exist in target.key1 filtered by Series.isin in boolean indexing:

merged = pd.concat([target, backup[~backup.key1.isin(target.key1)]])
print (merged)
  key1    A   B
0   K1   A1  B1
1   K2   A2  B2
2   K3   A3  B3
3   K5  NaN  B5
3   K4   A4  B4
Sign up to request clarification or add additional context in comments.

Comments

0

Maybe you can try this with a 'subset' parameter in df.drop_duplicates()?

pd.concat([target, backup]).drop_duplicates(subset = "key1")

which gives output:

  key1    A   B
0   K1   A1  B1
1   K2   A2  B2
2   K3   A3  B3
3   K5  NaN  B5
3   K4   A4  B4

1 Comment

Hi, thanks for the answer. For some reason, this does not work on my dataset,it shows the complement (or something of that sort). I have not yet figured out why or why does my MRE not cover that but I wanted to state it in case someone was faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.