Compare two dataframes based on column data in Python pandas

Question

I have two dataframes, df1 and df2, and I would like to substruct the df2 from df1 and using as a row comparison a specific column, 'Code'

import pandas as pd
import numpy as np
rng = pd.date_range('2021-01-01', periods=10, freq='D')
df1 = pd.DataFrame(index=rng, data={'Val1': range(10), 'Val2': np.array(range(10))*5, 'Code': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]})

df2 = pd.DataFrame(data={'Code': [1, 2, 3, 4], 'Val1': [10, 5, 15, 20], 'Val2': [4, 8, 10, 7]})

df1:

            Val1  Val2  Code
2021-01-01     0     0     1
2021-01-02     1     5     1
2021-01-03     2    10     1
2021-01-04     3    15     2
2021-01-05     4    20     2
2021-01-06     5    25     2
2021-01-07     6    30     3
2021-01-08     7    35     3
2021-01-09     8    40     3
2021-01-10     9    45     3

df2:

   Code  Val1  Val2
0     1    10     4
1     2     5     8
2     3    15    10
3     4    20     7

I using the following code:

df = (df1.set_index(['Code']) - df2.set_index(['Code']))

and the result is

Code            
1    -10.0  -4.0
1     -9.0   1.0
1     -8.0   6.0
2     -2.0   7.0
2     -1.0  12.0
2      0.0  17.0
3     -9.0  20.0
3     -8.0  25.0
3     -7.0  30.0
3     -6.0  35.0
4      NaN   NaN

However, I only want to get the results for the rows that are in df1 and not the missing keys, in this example the 4.

How do I do it and then to set back the index to the original from df1?

Something like that but it doesn't work:

df = (df1.set_index(['Code']) - df2.set_index(['Code'])).set_index(df1['Code'])

Also I would like to keep the headers of the columns.

Desired output:

            Val1  Val2  Code
Date                        
2021-01-01 -10.0  -4.0     1
2021-01-02  -9.0   1.0     1
2021-01-03  -8.0   6.0     1
2021-01-04  -2.0   7.0     2
2021-01-05  -1.0  12.0     2
2021-01-06   0.0  17.0     2
2021-01-07  -9.0  20.0     3
2021-01-08  -8.0  25.0     3
2021-01-09  -7.0  30.0     3
2021-01-10  -6.0  35.0     3

Can you add your desired outcome please? It will make it simpler for us to get you what you need. — sophocles
– sophocles, Commented Feb 24, 2021 at 9:04

Anurag Dabas · Accepted Answer · 2021-02-24 08:49:41Z

1

If you want to get the results for the rows that are in df1 and not the missing keys, in this example the 4 then just use drop_na() method

df = (df1.set_index(['Code']) - df2.set_index(['Code'])).dropna()

then:-

df.insert(0,'Date',df1.index)

And Finally:-

df.reset_index(inplace=True)
df.set_index('Date',inplace=True)

Now if you print df you will get your desired output:-

           Code  Val1   Val2
Date            
2021-01-01  1   -10.0   -4.0
2021-01-02  1   -9.0    1.0
2021-01-03  1   -8.0    6.0
2021-01-04  2   -2.0    7.0
2021-01-05  2   -1.0    12.0
2021-01-06  2   0.0     17.0
2021-01-07  3   -9.0    20.0
2021-01-08  3   -8.0    25.0
2021-01-09  3   -7.0    30.0
2021-01-10  3   -6.0    35.0

Note:-In case this is not your desired output then let me know

answered Feb 24, 2021 at 8:49

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Cameron Riddell · Accepted Answer · 2021-02-24 09:22:50Z

1

You can use reindex to align df2 to df1["code"]. Then we can take the underlying numpy ndarray and subtract that inplace from the corresponding columns df1. This will leave both the index and the "code" column untouched and perform subtraction as expected.

subtract_values = df2.set_index("Code").reindex(df1["Code"]).to_numpy()
df1[["Val1", "Val2"]] -= subtract_values

print(df1)
            Val1  Val2  Code
2021-01-01   -10    -4     1
2021-01-02    -9     1     1
2021-01-03    -8     6     1
2021-01-04    -2     7     2
2021-01-05    -1    12     2
2021-01-06     0    17     2
2021-01-07    -9    20     3
2021-01-08    -8    25     3
2021-01-09    -7    30     3
2021-01-10    -6    35     3

If you don't want to change df1, you can copy the data to a new DataFrame via new_df = df1.copy() and proceeding with new_df instead of df1

answered Feb 24, 2021 at 9:22

Cameron Riddell

13.8k14 silver badges21 bronze badges

1 Comment

Thanasis Over a year ago

I would prefer to independent on the column names, ie not to specify the val1, val2, val3 etc.

Collectives™ on Stack Overflow

Compare two dataframes based on column data in Python pandas

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related