0

I have a data frame that I'd like to filter by a column that is of type array. What is the most effective way to do this?

df = pd.DataFrame({'a': [1,2,3,4,5], 'b': [['true','false'],['false'],['false','false','false'],['false','false','true'],[]]})
df

    a   b
0   1   [true, false]
1   2   [false]
2   3   [false, false, false]
3   4   [false, false, true]
4   5   []

I'd ideally like to only return rows that contain a true value.

3
  • 1
    array is not a dtype. There no real effective way to work with lists in pandas.DataFrame's, but you could always do something like df[df.b.apply(lambda x: 'true' in x)] Commented Jan 30, 2018 at 1:14
  • @juanpa.arrivillaga would using any() be more performant? Commented Jan 30, 2018 at 1:19
  • @pault well, if they are actualy numpy.ndarray objects instead of list objects, maybe slightly, but the time sink is iterating over the rows, which is necessitated in this case. Furthermore, those arrays would be dtype=object anyway, so iteration would still be slow Commented Jan 30, 2018 at 1:21

1 Answer 1

4

Without loop :-)

df[pd.DataFrame(df.b.tolist()).eq('true').any(1)]
Out[98]: 
   a                     b
0  1         [true, false]
3  4  [false, false, true]
Sign up to request clarification or add additional context in comments.

1 Comment

Wonderful answer... very impressive.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.