3

I have a dataframe contain a large number of rows (several million). One of the columns contains a string containing a comma-separated list (but not a Python list, just a list of items separated by commas). The dataframe can be represented as:

df = pd.DataFrame({'A':['a,b,c','b,c,d,e','a,b,e,f','a,c,d,f']})

         A
0    a,b,c
1  b,c,d,e
2  a,b,e,f
3  a,c,d,f

I have a separate Python list containing various elements such as:

lst1 = ['w','x','y','z','b']

I would like to create an additional column in the database that indicates whether one of the elements in lst1 is contained in column A of the database.

My solution has been to convert the list elements to a regular expression and to use the .str.contains() structure to label the rows as either True or False:

regex = regex = '|'.join(['(?:{})'.format(i) for i in lst1])

This produces the following regex:

(?:w)|(?:x)|(?:y)|(?:z)|(?:b)

Then:

df['B'] = df['A'].str.contains(regex)

         A      B
0    a,b,c   True
1  b,c,d,e   True
2  a,b,e,f   True
3  a,c,d,f  False

This works fine on the mini example described but, in a real-world example with a dataframe containing millions of rows, I'm concerned that the use of regular expressions may be too slow to be practical. Is there a faster way to achieve the same outcome?

EDIT

Following an answer by @jezrael, I performed a timing comparison. I generated a dataframe with 4M rows and a list of items to identify as follows:

import timeit

df = pd.DataFrame({'A':['the,cat,sat,on,mat','the,cow,jumped,over,moon','humpty,dumpty,sat,on,the,wall','tiger,burning,bright']*1000000})

terms = ['sat','mat','moon','small','large','home','sliced']
regex = '|'.join(['(?:{})'.format(i) for i in terms])

%timeit df['B'] = df['A'].str.contains(regex)

This produced:

1 loop, best of 3: 8.09 s per loop

Compared with:

import timeit

df = pd.DataFrame({'A':['the,cat,sat,on,mat','the,cow,jumped,over,moon','humpty,dumpty,sat,on,the,wall','tiger,burning,bright']*1000000})

terms = ['sat','mat','moon','small','large','home','sliced']
s = set(terms)

%timeit df['B1'] = [bool(set(x.split(',')) & s) for x in df['A']]

Which produced:

1 loop, best of 3: 8.36 s per loop

So broadly similar results in this particular setup although, as @jezrael says, the performance of the regex option will be influenced by lots of factors such as length of strings, number of matches, etc.

1 Answer 1

3

One non regex solution is use intersection of sets and convert to bool:

s = set(lst1)
df['B1'] = [bool(set(x.split(',')) & s) for x in df['A']]
print (df)
         A      B     B1
0    a,b,c   True   True
1  b,c,d,e   True   True
2  a,b,e,f   True   True
3  a,c,d,f  False  False
Sign up to request clarification or add additional context in comments.

3 Comments

That's a nice alternative. Thanks for the suggestion. I'll have to set up a comparison to look at timings when used with large dataframes.
@user1718097 - Yes, the best test with large data, but possible str.contains should be faster (also depends of number of data in list and number of matched values)
Edited original to add some time comparisons. In fact, both methods performed similarly in the rather limited setup that I tested.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.