Identifying whether Pandas dataframe column contains element from a list

Question

I have a dataframe contain a large number of rows (several million). One of the columns contains a string containing a comma-separated list (but not a Python list, just a list of items separated by commas). The dataframe can be represented as:

df = pd.DataFrame({'A':['a,b,c','b,c,d,e','a,b,e,f','a,c,d,f']})

         A
0    a,b,c
1  b,c,d,e
2  a,b,e,f
3  a,c,d,f

I have a separate Python list containing various elements such as:

lst1 = ['w','x','y','z','b']

I would like to create an additional column in the database that indicates whether one of the elements in lst1 is contained in column A of the database.

My solution has been to convert the list elements to a regular expression and to use the .str.contains() structure to label the rows as either True or False:

regex = regex = '|'.join(['(?:{})'.format(i) for i in lst1])

This produces the following regex:

(?:w)|(?:x)|(?:y)|(?:z)|(?:b)

Then:

df['B'] = df['A'].str.contains(regex)

         A      B
0    a,b,c   True
1  b,c,d,e   True
2  a,b,e,f   True
3  a,c,d,f  False

This works fine on the mini example described but, in a real-world example with a dataframe containing millions of rows, I'm concerned that the use of regular expressions may be too slow to be practical. Is there a faster way to achieve the same outcome?

EDIT

Following an answer by @jezrael, I performed a timing comparison. I generated a dataframe with 4M rows and a list of items to identify as follows:

import timeit

df = pd.DataFrame({'A':['the,cat,sat,on,mat','the,cow,jumped,over,moon','humpty,dumpty,sat,on,the,wall','tiger,burning,bright']*1000000})

terms = ['sat','mat','moon','small','large','home','sliced']
regex = '|'.join(['(?:{})'.format(i) for i in terms])

%timeit df['B'] = df['A'].str.contains(regex)

This produced:

1 loop, best of 3: 8.09 s per loop

Compared with:

import timeit

df = pd.DataFrame({'A':['the,cat,sat,on,mat','the,cow,jumped,over,moon','humpty,dumpty,sat,on,the,wall','tiger,burning,bright']*1000000})

terms = ['sat','mat','moon','small','large','home','sliced']
s = set(terms)

%timeit df['B1'] = [bool(set(x.split(',')) & s) for x in df['A']]

Which produced:

1 loop, best of 3: 8.36 s per loop

So broadly similar results in this particular setup although, as @jezrael says, the performance of the regex option will be influenced by lots of factors such as length of strings, number of matches, etc.

jezrael · Accepted Answer · 2019-02-19 09:50:55Z

3

One non regex solution is use intersection of sets and convert to bool:

s = set(lst1)
df['B1'] = [bool(set(x.split(',')) & s) for x in df['A']]
print (df)
         A      B     B1
0    a,b,c   True   True
1  b,c,d,e   True   True
2  a,b,e,f   True   True
3  a,c,d,f  False  False

answered Feb 19, 2019 at 9:50

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1718097 Over a year ago

That's a nice alternative. Thanks for the suggestion. I'll have to set up a comparison to look at timings when used with large dataframes.

jezrael Over a year ago

@user1718097 - Yes, the best test with large data, but possible str.contains should be faster (also depends of number of data in list and number of matched values)

user1718097 Over a year ago

Edited original to add some time comparisons. In fact, both methods performed similarly in the rather limited setup that I tested.

Collectives™ on Stack Overflow

Identifying whether Pandas dataframe column contains element from a list

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related