3

I am trying to filter a dataframe where some columns are lists. And I want to base the filter out elements that does not pass the condition.

For example:

import pandas as pd 
df = pd.DataFrame({'col1':[10,20], 'col2': [[1,2,3],[3,4,5]], 'col3': [[False,False,True],[True,True,False]],'col4':[True,False]})
   col1       col2                  col3   col4
0    10  [1, 2, 3]  [False, False, True]   True
1    20  [3, 4, 5]   [True, True, False]  False

applying the filter

df_filtered = df.query("col2>2 & col3==True")

the output I expect

enter image description here

Thanks for the help!

5
  • Maybe you want to transform your data as in this question then query. Commented Feb 8, 2021 at 15:27
  • It looks like he is trying to use the boolean lists in col3 as a filter against the lists in col2. Col4 seems irrelevant Commented Feb 8, 2021 at 15:29
  • @GiorgosMyrianthous because it does not satisfy condition on col3 Commented Feb 8, 2021 at 15:29
  • @QuangHoang if you mean using explode(), I have tried it but it is very slow and ended up blowing up the size of the dataframe. I am working on very large dataset unfortunately. Commented Feb 8, 2021 at 15:31
  • @Ben.T yes they are Commented Feb 8, 2021 at 15:34

4 Answers 4

4

Try:

df[['col2','col3']] = (pd.DataFrame({'col2': df['col2'].explode(),
                                     'col3': df['col3'].explode()})
                         .query('col2>2 & col3==True')
                         .groupby(level=0).agg(list)
                      )

Output:

print(df)

   col1    col2          col3   col4
0    10     [3]        [True]   True
1    20  [3, 4]  [True, True]  False
Sign up to request clarification or add additional context in comments.

2 Comments

thank you Quang! as you said, it is not as bad performance-wise as I thought!
The version from Ben. T. is 10x faster than this...
2

You can use numpy and an iterative approach if memory is the main constraint.

This modifies the dataframe in place without having to create a large interim data structure in the process:

import pandas as pd
import numpy as np

df = pd.DataFrame({'col1':[10,20], 'col2': [[1,2,3],[3,4,5]], 'col3': [[False,False,True],[True,True,False]]})

for idx, row in df.iterrows():
    a1=np.array(row['col2'])
    a2=np.array(row['col3'])
    df.at[idx,'col2']=a1[(a1>2) & a2]
    df.at[idx,'col3']=a2[a2]

>>> df
   col1    col2          col3
0    10     [3]        [True]
1    20  [3, 4]  [True, True]

Comments

1

As lists are same size across the rows, you can probably use arrays and mask like this

arr2 = np.array(df['col2'].tolist())
arr3 = np.array(df['col3'].tolist())

df[['col2','col3']] = [[c2[b],c3[b]] for c2,c3,b in zip(arr2,arr3,(arr2>=2) & arr3)]

print(df)
   col1    col2          col3   col4
0    10     [3]        [True]   True
1    20  [3, 4]  [True, True]  False

1 Comment

This is 10x faster than the version with .explode. I timed it...
0

Another way with loops, but probably slower:

for index, row in df.iterrows():
    j=0
    for i in df.at[index, 'col3']:
        if i==False:
            df.at[index, 'col2'].remove(df.at[index, 'col2'][j])
        else:
            j=j+1
    df.at[index, 'col3']=list(filter(None, df.at[index, 'col3']))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.