Subsetting pandas dataframe with list in cell

Question

Suppose I have the following dataframe

df = pd.DataFrame({'col1': ['one','one', 'one', 'one', 'two'],
                   'col2': ['two','two','four','four','two'],
                   'col3': [['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'nodata', 'beta', 'gamma']]})

I know I can subset with:

df[df['col2']=='four']

How do I subset so that it matches a string INSIDE of a list? in this example, subset the rows that don't contain 'nodata' in col3?

df[~df['col3'].str.contains('nodata')

doesn't seem to work and I can't properly seem to access the 'right' item inside of the list.

Are you trying to get the row that does contain "nodata" or all rows that do not? You say that you want to get that row, but your example code is negating on the condition, implying that you want the rows that do not contain that. — Matthew
– Matthew, Commented Jan 19, 2016 at 22:31

johnchase · Accepted Answer · 2016-01-19 22:39:33Z

3

Rather than converting data types you can use apply with a lambda function which will be a bit faster.

df[~df.col3.apply(lambda x: 'nodata' in x)]

Testing it on a larger dataset:

In [86]: df.shape
Out[86]: (5000, 3)

My solution:

In [88]: %timeit df[~df.col3.apply(lambda x: 'nodata' in x)]
         1000 loops, best of 3: 1.68 ms per loop

Previous solution:

In [87]: %timeit df[~df['col3'].astype(str).str.contains('nodata')]
         100 loops, best of 3: 7.8 ms per loop

Arguably the first answer may be more readable though.

answered Jan 19, 2016 at 22:39

johnchase

13.8k7 gold badges44 silver badges70 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

eatkimchi Over a year ago

agreed. lambdas are quite useful. might you know a place where i can drill python lambdas and learn them once and for all?

johnchase Over a year ago

I would probably start with the Python docs and then maybe google a bit for a tutorial. The key to how it is being used here is that it is a function that is being applied to every cell value in df.col3 Good luck!

maxymoo Over a year ago

it's even faster (and you can lose the lambda) if you use a list comprehension instead: df[['nodata' not in x for x in df.col3]]

maxymoo · Accepted Answer · 2016-01-19 22:21:49Z

1

Your code should work if you convert the column's datatype to string:

df[~df['col3'].astype(str).str.contains('nodata')]

answered Jan 19, 2016 at 22:21

maxymoo

36.7k12 gold badges97 silver badges121 bronze badges

Collectives™ on Stack Overflow

Subsetting pandas dataframe with list in cell

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related