2

Suppose I have the following dataframe

df = pd.DataFrame({'col1': ['one','one', 'one', 'one', 'two'],
                   'col2': ['two','two','four','four','two'],
                   'col3': [['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'beta'],
                            ['alpha', 'nodata', 'beta', 'gamma']]})

I know I can subset with:

df[df['col2']=='four']

How do I subset so that it matches a string INSIDE of a list? in this example, subset the rows that don't contain 'nodata' in col3?

df[~df['col3'].str.contains('nodata') 

doesn't seem to work and I can't properly seem to access the 'right' item inside of the list.

1
  • Are you trying to get the row that does contain "nodata" or all rows that do not? You say that you want to get that row, but your example code is negating on the condition, implying that you want the rows that do not contain that. Commented Jan 19, 2016 at 22:31

2 Answers 2

3

Rather than converting data types you can use apply with a lambda function which will be a bit faster.

df[~df.col3.apply(lambda x: 'nodata' in x)]

Testing it on a larger dataset:

In [86]: df.shape
Out[86]: (5000, 3)   

My solution:

In [88]: %timeit df[~df.col3.apply(lambda x: 'nodata' in x)]
         1000 loops, best of 3: 1.68 ms per loop

Previous solution:

In [87]: %timeit df[~df['col3'].astype(str).str.contains('nodata')]
         100 loops, best of 3: 7.8 ms per loop

Arguably the first answer may be more readable though.

Sign up to request clarification or add additional context in comments.

3 Comments

agreed. lambdas are quite useful. might you know a place where i can drill python lambdas and learn them once and for all?
I would probably start with the Python docs and then maybe google a bit for a tutorial. The key to how it is being used here is that it is a function that is being applied to every cell value in df.col3 Good luck!
it's even faster (and you can lose the lambda) if you use a list comprehension instead: df[['nodata' not in x for x in df.col3]]
1

Your code should work if you convert the column's datatype to string:

df[~df['col3'].astype(str).str.contains('nodata')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.