2

I would like to selectively overwrite values in a dataframe using another dataframe using a column that is not the index of either dataframe. I can solve this problem by temporarily switching the index columns around, but I feel like there has to be a better/more efficient way. Searching here on SE and elsewhere was not fruitful.

example data

Note a couple key points:

  • df2 has more rows than are required, and those extra rows should not be used
  • the values of 'B' are not in the same order in the two dfs
  • The existing indices don't match. The whole point of my question is that matching on existing indices should not be used.

Code:

df1 = pd.DataFrame({
    'A':['lorem','ipsum','dolor','sit'],
    'B':[1,2,3,4],
    'C':[30,40,5000,6000]})

df2 = pd.DataFrame({
    'B':[4,3,5,6],
    'C':[60,50,70,80]})


df1:
   A      B    C
0  lorem  1    30
1  ipsum  2    40
2  dolor  3    5000
3  sit    4    6000


df2:
   B    C
0  4    60
1  3    50
2  5    70
3  6    80

my desired output

   A      B    C
0  lorem  1    30
1  ipsum  2    40
2  dolor  3    50
3  sit    4    60

my non-ideal solution

# save indices and columns for both dfs, then re-index both
col_order1 = df1.columns
old_index1 = df1.index # not needed in my example, but needed in generalized case
df1.set_index('B', inplace=True)

col_order2 = df2.columns
old_index2 = df2.index 
df2.set_index('B', inplace=True)

# value substitution based on the new indices
df1.loc[df1.index.isin(df2.index), 'C'] = df2['C']

# undo the index changes to df1 and df2
df1.reset_index(inplace=True)
df1 = df1[col_order1]
df1.index = old_index1

df2.reset_index(inplace=True)
df2 = df2[col_order2]
df2.index = old_index2

Clearly this works, but I am new to Pandas and I feel like I am missing knowledge of some built-in method to do what I describe.

How can I achieve the desired result without having to shuffle those indices around?

2
  • Keywords: merge or map with pandas. Commented Nov 23, 2020 at 19:21
  • 1
    @QuangHoang yes, I have looked these up in the docs. I wouldn't be asking if it was as easy for me as "RTFM". If the solution is that trivial to you, why not answer the question? As it stands, your response is not terribly useful. Commented Nov 23, 2020 at 19:30

1 Answer 1

1

I would merge and combine_first()

newDF = df1.merge(df2,
         left_on="B",
         right_on="B",
         how='left', 
         suffixes=["", "_df2"])

newDF["C"] = newDF["C_df2"].combine_first(newDF["C"]).apply(int)
print(newDF[["A","B","C"]])

       A  B   C
0  lorem  1  30
1  ipsum  2  40
2  dolor  3  50
3    sit  4  60


Notes:

  • specifying suffixes is desirable when you have the same column name in each side of the join just to keep things easy to read - I use an empty suffix for the left side
  • I used .apply(int) there because the merge generates NaN values where the join key from df1 is not present in df2. If I recall correctly, presence of NaN in a column of integers converts the column to floats.
Sign up to request clarification or add additional context in comments.

1 Comment

Although I wish there was a cleaner way to do this, your method works exactly as intended. A simple function definition could turn this into a single, short line of code for easy implementation in my projects.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.