2

What is the best way to do string matching on a column of lists?
E.g. I have a dataset:

import numpy as np
import pandas as pd
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':xrange(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in xrange(3)]})
df

    L                           id
0   [tackle, apple, grapple]    0
1   [tackle, snapple, satchel]  1
2   [satchel, satchel, tackle]  2

And I want to return the rows where any item in L matches a string, e.g. 'grap' should return row 0, and 'sat' should return rows 1:2.

3
  • 1
    Blehh do you need lists of strings? In a DataFrame? I suppose you could do something like df.L.apply(lambda row: any('whatever' in word for word in row)) but this whole problem feels like one you shouldn't want to have. Commented Nov 22, 2017 at 18:59
  • @miradulo what is a better way to store them in this context? Seems accessible and centralized to me but I'm fairly new to data structures. Commented Nov 22, 2017 at 19:16
  • I would mostly question why you're using a DataFrame at this point instead of just a Python dict mapping ids to lists or whatnot. Unless you're getting some benefit from the DataFrame it is just adding some overhead. Commented Nov 22, 2017 at 19:32

2 Answers 2

3

Let's use this:

np.random.seed(123)
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':range(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in range(3)]})
df
                             L  id
0    [tackle, snapple, tackle]   0
1   [grapple, satchel, tackle]   1
2  [satchel, grapple, grapple]   2

Use any and apply:

df[df.L.apply(lambda x: any('grap' in s for s in x))]

Output:

                             L  id
1   [grapple, satchel, tackle]   1
2  [satchel, grapple, grapple]   2

Timings:

%timeit df.L.apply(lambda x: any('grap' in s for s in x))

10000 loops, best of 3: 194 µs per loop

%timeit df.L.apply(lambda i: ','.join(i)).str.contains('grap')

1000 loops, best of 3: 481 µs per loop

%timeit df.L.str.join(', ').str.contains('grap')

1000 loops, best of 3: 529 µs per loop

Sign up to request clarification or add additional context in comments.

Comments

2

df[df.L.apply(lambda i: ','.join(i)).str.contains('yourstring')]

1 Comment

df[df.L.str.join(', ').str.contains('grap')] would do?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.