5

I have a two dataframes. I want to see if specific rows (in its entirety) exist in the other dataframe. Example rows from df_subset:

id    category    value    date
1     A           10       01-01-15
3     C           10       03-01-15

The other df_full:

id    category    value    date
1     A           10       01-01-15
2     B           10       02-01-15
3     C           10       03-01-15
4     D           16       04-01-15

Is there someway to check if the rows of one dataframe exists in another? Something like this (obviously this does not work): df_subset in df_full, exist?

> True

3 Answers 3

4

I think you can use merge with inner join (by default) with DataFrame.equals for compare with df_subset:

print (pd.merge(df_subset,df).equals(df_subset))
True
Sign up to request clarification or add additional context in comments.

Comments

3

you can use merge(..., indicator=True) method:

In [14]: pd.merge(df1, df2, indicator=True, how='outer')
Out[14]:
   id category  value      date      _merge
0   1        A     10  01-01-15        both
1   3        C     10  03-01-15        both
2   2        B     10  02-01-15  right_only
3   4        D     16  04-01-15  right_only

Comments

3

using numpy

(df_subset.values[:, None] == df_full.values).all(2).any(1).all()

True

timing
enter image description here

explanation

# using [:, None] to extend into new dimension at
# take advantage of broadcasting
a1 = df_subset.values[:, None] == df_full.values

     # ━> third dimension ━>
     # ━━━━> axis=2 ━━━>
# 1st dim
---->[[[ True  True  True  True]   # │
       [False False  True False]   # │ second dimension
       [False False  True False]   # │ axis=1
       [False False False False]]  # ↓

 # axis=0
 ---->[[False False  True False]   # │ 
       [False False  True False]   # │ second dimension
       [ True  True  True  True]   # │ axis=1
       [False False False False]]] # ↓

# first row of subset with each row of full 
[[[ True  True  True  True]  <-- This one is true for all 
  [False False  True False]
  [False False  True False]
  [False False False False]]

# second row of subset with each row of full 
 [[False False  True False] 
  [False False  True False]
  [ True  True  True  True]  <-- This one is true for all
  [False False False False]]]

a2 = a1.all(2)

#   ┌─ first row of subset all equal
[[ True False False False]
 [False False  True False]]
#               └─ second row of subset all equal

a3 = a2.any(1)

#  ┌─ first row of subset matched at least one row of full
[ True  True]
#        └─ second row of subset matched at least one row of full

a3.all()

True

ALL ROWS OF df_subset ARE IN df_full

2 Comments

I extended array to third dimension. all(2) is along third axis. I'll update post to make it clearer
Ya, it is really complicated. Numpy is for me complicated, I prefer pandas.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.