Python: String matching on a pandas column of lists

Question

What is the best way to do string matching on a column of lists?
E.g. I have a dataset:

import numpy as np
import pandas as pd
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':xrange(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in xrange(3)]})
df

    L                           id
0   [tackle, apple, grapple]    0
1   [tackle, snapple, satchel]  1
2   [satchel, satchel, tackle]  2

And I want to return the rows where any item in L matches a string, e.g. 'grap' should return row 0, and 'sat' should return rows 1:2.

Blehh do you need lists of strings? In a DataFrame? I suppose you could do something like df.L.apply(lambda row: any('whatever' in word for word in row)) but this whole problem feels like one you shouldn't want to have. — miradulo
– miradulo, Commented Nov 22, 2017 at 18:59
@miradulo what is a better way to store them in this context? Seems accessible and centralized to me but I'm fairly new to data structures. — rer
– rer, Commented Nov 22, 2017 at 19:16
I would mostly question why you're using a DataFrame at this point instead of just a Python dict mapping ids to lists or whatnot. Unless you're getting some benefit from the DataFrame it is just adding some overhead. — miradulo
– miradulo, Commented Nov 22, 2017 at 19:32

Community · Accepted Answer · 2020-06-20 09:12:55Z

3

Let's use this:

np.random.seed(123)
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':range(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in range(3)]})
df
                             L  id
0    [tackle, snapple, tackle]   0
1   [grapple, satchel, tackle]   1
2  [satchel, grapple, grapple]   2

Use any and apply:

df[df.L.apply(lambda x: any('grap' in s for s in x))]

Output:

                             L  id
1   [grapple, satchel, tackle]   1
2  [satchel, grapple, grapple]   2

Timings:

%timeit df.L.apply(lambda x: any('grap' in s for s in x))

10000 loops, best of 3: 194 µs per loop

%timeit df.L.apply(lambda i: ','.join(i)).str.contains('grap')

1000 loops, best of 3: 481 µs per loop

%timeit df.L.str.join(', ').str.contains('grap')

1000 loops, best of 3: 529 µs per loop

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Nov 22, 2017 at 19:03

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ndr · Accepted Answer · 2017-11-22 19:02:00Z

2

df[df.L.apply(lambda i: ','.join(i)).str.contains('yourstring')]

answered Nov 22, 2017 at 19:02

ndr

1,43711 silver badges11 bronze badges

1 Comment

Zero Over a year ago

df[df.L.str.join(', ').str.contains('grap')] would do?

Collectives™ on Stack Overflow

Python: String matching on a pandas column of lists

2 Answers 2

Timings:

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Timings:

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related