Compare pandas dataframes by multiple columns

Question

What is the best way to figure out how two dataframes differ based on a combination of multiple columns. So if I have the following:

df1:

  A B C
0 1 2 3
1 3 4 2

df2:

  A B C
0 1 2 3
1 3 5 2

Want to show all rows where there is a difference such as (3,4,2) vs. (3,5,2) from above example. I've tried using the pd.merge() thinking that if I use all columns as the key to join using outer join, I would end up with dataframe that would help me get what I want but it doesn't turn out that way.

Thanks to EdChum I was able to use a mask from a boolean diff as below but first had to make sure indexes were comparable.

df1 = df1.set_index('A')
df2 = df2.set_index('A') #this gave me a nice index using one of the keys.
                  #if there are different rows than I would get nulls. 
df1 = df1.reindex_like(df2)
df1[~(df1==df2).all(axis=1)] #this gave me all rows that differed.

EdChum · Accepted Answer · 2015-03-16 16:40:49Z

1

We can use .all and pass axis=1 to perform row comparisons, we can then use this boolean index to show the rows that differ by negating ~ the boolean index:

In [43]:

df[~(df==df1).all(axis=1)]
Out[43]:
   A  B  C
1  3  4  2

breaking this down:

In [44]:

df==df1
Out[44]:
      A      B     C
0  True   True  True
1  True  False  True
In [45]:

(df==df1).all(axis=1)
Out[45]:
0     True
1    False
dtype: bool

We can then pass the above as a boolean index to df and invert it using ~

edited Mar 16, 2015 at 16:40

answered Mar 16, 2015 at 16:31

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

horatio1701d Over a year ago

only thing is that my two dataframes are not identically labelled. Can only compare identically-labeled DataFrame objects. Is there a quick solution to this? Was thinking to a reindex_like perhaps?

EdChum Over a year ago

So what exactly will be different the column names? The number of rows?

horatio1701d Over a year ago

rows would be different. columns are same

EdChum Over a year ago

In what way are the rows different? more or fewer rows or either?

horatio1701d Over a year ago

it could be either fewer or more. basically get a new version of a dataset every month and want to be able to get a sense of how the records have shifted or changed in anyway.

|

Collectives™ on Stack Overflow

Compare pandas dataframes by multiple columns

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related