2

Lets say I have a pandas dataframe df with columns A, B, C, D, E, F, G, H, and I want to filter the dataframe using a function functn that takes in a "row" and returns true or false based on if the row fulfills certain conditions (lets say the function uses every column except for H). Is there a way to efficiently filter this dataframe without a long and ugly lambda? The solution I have so far looks like this:

df = df[df.apply(functn, axis=1)]

but this method seems to be VERY slow, even for a frame with 15k lines. Is there a clean and efficient way to filter a pandas dataframe using a user defined python function instead of a lambda or query?

note: I previously implemented this using plain python 2d arrays and it was MUCH faster than using pandas. Am I misusing a certain feature or not aware of a way to make this filtering process faster?

edit:

The data is structured roughly like this:

#       A       B       C     D     E     F      G        H      
[
    [string1, string2, int1, int2, int3, int4, float1, float2], 
    ...
]

The function does something like this:

def filter(row):
    var1 = row.G <= 0.01
    partial_a = (((row.D - row.C + 1)*1.0)/global_map[row.A])
    partial_b = (((row.F - row.E + 1)*1.0)/global_map[row.B])
    partial = partial_a >= 0.66 or partial_b >= 0.66
    return var1 and partial

The non-pandas implementation basically took the dataframe, which if not in pandas form was basically a 2d array, and looped through each element, applied the function to it (except the argument was a list instead of a "row"), and if it returned true, added that new element to another list.

4
  • 1
    It'll be easier to help you if you can provide (a) example data, (b) the details of fucntn, and (c) the non-Pandas implementation you used. As is, it's hard to know where your bottleneck is. (Benchmarking data would be nice too.) Commented Dec 14, 2017 at 4:20
  • I'll edit the original post to reflect this. Thanks! Commented Dec 14, 2017 at 4:20
  • @andrew_reece it should be updated! Commented Dec 14, 2017 at 4:26
  • Looking on your filter() apply function, I have noticed that you compute the global_map for each row. Is this computation time-efficient? Otherwise the apply function will calculate the global_map[row.A] and global_map[row.B] for each row..Therefore to speed up processing time you should had pre-computed these values (global_map[row.A] and global_map[row.B]) before the filter() and then pass them to filter() function as arguments. Hope it helps! Commented Oct 29, 2021 at 10:00

1 Answer 1

4

IIUC, you don't need a function. Let's use boolean indexing as follows:

cond1 = df['G'] <= 0.01
cond2 = (((df.D - df.C + 1)*1.0)/global_map[df.A]) >= 0.66
cond3 = (((df.F - df.E + 1)*1.0)/global_map[df.B]) >= 0.66

mask = cond1 & (cond2 | cond3)

df[mask]
Sign up to request clarification or add additional context in comments.

9 Comments

Thanks! As a followup, is there a way to generalize this mask such that I can apply it to other dataframes? I think this is specific since it references the one df I mentioned.
If global_map is a dict, you might need to vectorize the lookup, something like np.vectorize(global_map.get)(df.A)
You should be able to write that as a function to generate mask then apply the mask to the dataframe without looping.
dicts are basically just lookup tables, they're not used to having a whole bunch of keys crammed into a single lookup. have a look at this post for more options: stackoverflow.com/q/16992713/2799941
@ScottBoston it went from taking 3.894 seconds to 0.376 seconds for the entire python program, so roughly a 10x speedup.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.