Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe

Question

This question is based on another question I asked, where I didn't cover the problem entirely: Pandas - check if a string column contains a pair of strings

This is a modified version of the question.

I have two dataframes :

df1 = pd.DataFrame({'consumption':['squirrel ate apple', 'monkey likes apple', 
                                  'monkey banana gets', 'badger gets banana', 'giraffe eats grass', 'badger apple loves', 'elephant is huge', 'elephant eats banana tree', 'squirrel digs in grass']})

df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 
                   'creature':['squirrel', 'badger', 'monkey', 'elephant']})

The goal is to test if df.food:df.creature pairs are present in df1.consumptions.

The expected answer for this test in the above example would be :

['True', 'False', 'True', 'False', 'False', 'True', 'False', 'True', 'False']

The pattern is:

squirrel ate apple = True since squirrel and apple is a pair. monkey likes apple = False since monkey and apple is not a pair we are looking for.

I was thinking of constructing a dictionary of dataframes of the pair-values where each dataframe would be for one creature for e.g.squirrel, monkey etc. and then using np.where to create a boolean expression and perform a str.contains.

Not sure if that is the easiest way.

MaxU - stand with Ukraine · Accepted Answer · 2017-04-17 00:20:20Z

3

Consider this vectorized approach:

from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()

X = vect.fit_transform(df1.consumption)
Y = vect.transform(df2.creature + ' ' + df2.food)

res = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))

Result:

In [67]: res
Out[67]: array([ True, False,  True, False, False,  True, False,  True, False], dtype=bool)

Explanation:

In [68]: pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
Out[68]:
   apple  ate  badger  banana  digs  eats  elephant  gets  giraffe  grass  huge  in  is  likes  loves  monkey  squirrel  tree
0      1    1       0       0     0     0         0     0        0      0     0   0   0      0      0       0         1     0
1      1    0       0       0     0     0         0     0        0      0     0   0   0      1      0       1         0     0
2      0    0       0       1     0     0         0     1        0      0     0   0   0      0      0       1         0     0
3      0    0       1       1     0     0         0     1        0      0     0   0   0      0      0       0         0     0
4      0    0       0       0     0     1         0     0        1      1     0   0   0      0      0       0         0     0
5      1    0       1       0     0     0         0     0        0      0     0   0   0      0      1       0         0     0
6      0    0       0       0     0     0         1     0        0      0     1   0   1      0      0       0         0     0
7      0    0       0       1     0     1         1     0        0      0     0   0   0      0      0       0         0     1
8      0    0       0       0     1     0         0     0        0      1     0   1   0      0      0       0         1     0

In [69]: pd.DataFrame(Y.toarray(), columns=vect.get_feature_names())
Out[69]:
   apple  ate  badger  banana  digs  eats  elephant  gets  giraffe  grass  huge  in  is  likes  loves  monkey  squirrel  tree
0      1    0       0       0     0     0         0     0        0      0     0   0   0      0      0       0         1     0
1      1    0       1       0     0     0         0     0        0      0     0   0   0      0      0       0         0     0
2      0    0       0       1     0     0         0     0        0      0     0   0   0      0      0       1         0     0
3      0    0       0       1     0     0         1     0        0      0     0   0   0      0      0       0         0     0

UPDATE:

In [92]: df1['match'] = np.ravel(np.any((X.dot(Y.T) > 1).todense(), axis=1))

In [93]: df1
Out[93]:
                 consumption  match
0         squirrel ate apple   True
1         monkey likes apple  False
2         monkey banana gets   True
3         badger gets banana  False
4         giraffe eats grass  False
5         badger apple loves   True
6           elephant is huge  False
7  elephant eats banana tree   True
8     squirrel digs in grass  False
9        squirrel.eats/apple   True   # <----- NOTE

edited Apr 17, 2017 at 0:20

answered Apr 16, 2017 at 23:53

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

vagabond Over a year ago

thanks - there's one caveat - there's no fixed pattern to occurrence of creature and food. so this: Y = vect.transform(df2.creature + ' ' + df2.food) won't work. Sorry, I just modified the consumption values in the question to reflect that.

MaxU - stand with Ukraine Over a year ago

@vagabond, did you test my solution against your modified data sets? ;-)

vagabond Over a year ago

woah ! it works! Is there any way to extract the creature and food from the rows it matched in? Also my look up data is in 100K ++ rows. The sparse matrix may shoot the memory?

MaxU - stand with Ukraine Over a year ago

@vagabond, the sparse matrixes are extremely memory-saving (that's their main purpose). About extraction - could you open a new question and make there an example?

vagabond Over a year ago

hmm , yes I could - the other problem really is that sometimes by text has no spaces - squirrel.eats/apple . . . it's URL data.

|

piRSquared · Accepted Answer · 2017-04-17 02:15:14Z

This is my answer using comprehensions and zip
Note, this checks substrings in df1

c = df1.consumption.values.tolist()
f = df2.food.values.tolist()
a = df2.creature.values.tolist() 

check = np.array([[fd in cs and cr in cs for fd, cr in zip(f, a)] for cs in c])

check.any(1)

array([ True, False,  True, False, False,  True, False,  True, False], dtype=bool)

This is a pandas version of what @MaxU did. Respect what he did... it is awesome!

X = df1.consumption.str.get_dummies(' ')
Y = (df2.creature + ' ' + df2.food).str.get_dummies(' ') \
    .reindex_axis(X.columns, 1, fill_value=0)

# This is where you can see which rows from `df2` (columns)
# matched with which rows from `df1` (rows) 
XY = X.dot(Y.T)

print(XY)

   0  1  2  3
0  2  1  0  0
1  1  1  1  0
2  0  0  2  1
3  0  1  1  1
4  0  0  0  0
5  1  2  0  0
6  0  0  0  1
7  0  0  1  2
8  1  0  0  0

# return the desired `True`s and `False`s

XY.gt(1).any(1)

0     True
1    False
2     True
3    False
4    False
5     True
6    False
7     True
8    False
dtype: bool

naive testing

Collectives™ on Stack Overflow

Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe

2 Answers 2

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related