filter pandas dataframe using function

Question

Lets say I have a pandas dataframe df with columns A, B, C, D, E, F, G, H, and I want to filter the dataframe using a function functn that takes in a "row" and returns true or false based on if the row fulfills certain conditions (lets say the function uses every column except for H). Is there a way to efficiently filter this dataframe without a long and ugly lambda? The solution I have so far looks like this:

df = df[df.apply(functn, axis=1)]

but this method seems to be VERY slow, even for a frame with 15k lines. Is there a clean and efficient way to filter a pandas dataframe using a user defined python function instead of a lambda or query?

note: I previously implemented this using plain python 2d arrays and it was MUCH faster than using pandas. Am I misusing a certain feature or not aware of a way to make this filtering process faster?

edit:

The data is structured roughly like this:

#       A       B       C     D     E     F      G        H      
[
    [string1, string2, int1, int2, int3, int4, float1, float2], 
    ...
]

The function does something like this:

def filter(row):
    var1 = row.G <= 0.01
    partial_a = (((row.D - row.C + 1)*1.0)/global_map[row.A])
    partial_b = (((row.F - row.E + 1)*1.0)/global_map[row.B])
    partial = partial_a >= 0.66 or partial_b >= 0.66
    return var1 and partial

The non-pandas implementation basically took the dataframe, which if not in pandas form was basically a 2d array, and looped through each element, applied the function to it (except the argument was a list instead of a "row"), and if it returned true, added that new element to another list.

It'll be easier to help you if you can provide (a) example data, (b) the details of fucntn, and (c) the non-Pandas implementation you used. As is, it's hard to know where your bottleneck is. (Benchmarking data would be nice too.) — andrew_reece
– andrew_reece, Commented Dec 14, 2017 at 4:20
Looking on your filter() apply function, I have noticed that you compute the global_map for each row. Is this computation time-efficient? Otherwise the apply function will calculate the global_map[row.A] and global_map[row.B] for each row..Therefore to speed up processing time you should had pre-computed these values (global_map[row.A] and global_map[row.B]) before the filter() and then pass them to filter() function as arguments. Hope it helps! — Damianos P. Melidis
– Damianos P. Melidis, Commented Oct 29, 2021 at 10:00

Scott Boston · Accepted Answer · 2017-12-14 04:51:09Z

4

IIUC, you don't need a function. Let's use boolean indexing as follows:

cond1 = df['G'] <= 0.01
cond2 = (((df.D - df.C + 1)*1.0)/global_map[df.A]) >= 0.66
cond3 = (((df.F - df.E + 1)*1.0)/global_map[df.B]) >= 0.66

mask = cond1 & (cond2 | cond3)

df[mask]

answered Dec 14, 2017 at 4:51

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

swarajd Over a year ago

Thanks! As a followup, is there a way to generalize this mask such that I can apply it to other dataframes? I think this is specific since it references the one df I mentioned.

andrew_reece Over a year ago

If global_map is a dict, you might need to vectorize the lookup, something like np.vectorize(global_map.get)(df.A)

Scott Boston Over a year ago

You should be able to write that as a function to generate mask then apply the mask to the dataframe without looping.

andrew_reece Over a year ago

dicts are basically just lookup tables, they're not used to having a whole bunch of keys crammed into a single lookup. have a look at this post for more options: stackoverflow.com/q/16992713/2799941

swarajd Over a year ago

@ScottBoston it went from taking 3.894 seconds to 0.376 seconds for the entire python program, so roughly a 10x speedup.

|

Collectives™ on Stack Overflow

filter pandas dataframe using function

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related