1

How can I select a subset of a pandas dataframe based on the condition if a column which is a nested list contains a given string.

import pandas as pd

df = pd.DataFrame({'id': [12, 34, 43], 'course': ['Mathematics', 'Sport', 'Biology'], 'students': [['John Doe', 'Peter Parker', 'Lois Lane'], ['Bruce Banner', 'Lois Lane'], ['John Doe', 'Bruce Banner']]})

And now I would like to select all rows in which John Doe is in the students.

3 Answers 3

1
df[df.students.apply(lambda row: "John Doe" in row)]
Sign up to request clarification or add additional context in comments.

Comments

0

Here is a vectorized option:

df[(df['students'].explode() == 'John Doe').groupby(level=0).any()]

2 Comments

This is not performant compared to apply. The explode and followed by groupby consumes time.
@SomeDude Yep - you are correct. Just timed it myself.
0

You can use str methods (first join list to ',' separated values and then look for 'John Doe'):

df[df['students'].str.join(',').str.match('John Doe')]

But actually the apply method can be more performant.

The timeit for a bigger dataframe containing 27 rows(repeated the original df):

%timeit df[df['students'].str.join(',').str.match('John Doe')]
382 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[df.students.apply(lambda row: "John Doe" in row)]
271 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Output:

   id       course                             students
0  12  Mathematics  [John Doe, Peter Parker, Lois Lane]
2  43      Biology             [John Doe, Bruce Banner]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.