2

I am working in Pandas, and I want to apply multiple filters to a data frame across multiple fields.

I am working with another, more complex data frame, but I am simplifying the contex for this question. Here is the setup for a sample data frame:

dates = pd.date_range('20170101', periods=16)
rand_df = pd.DataFrame(np.random.randn(16,4), index=dates, columns=list('ABCD'))

Applying one filter to this data frame is well documented and simple:

rand_df.loc[lambda df: df['A'] < 0]

Since the lambda looks like a simple boolean expression. It is tempting to do the following. This does not work, since, instead of being a boolean expression, it is a callable. Multiple of these cannot combine as boolean expressions would:

rand_df.loc[lambda df: df['A'] < 0 and df[‘B’] < 0]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-dfa05ab293f9> in <module>()
----> 1 rand_df.loc[lambda df: df['A'] < 0 and df['B'] < 0]

I have found two ways to successfully implement this. I will add them to the potential answers, so you can comment directly on them as solutions. However, I would like to solicit other approaches, since I am not really sure that either of these is a very standard approach for filtering a Pandas data frame.

3
  • UN-DUPLICATE: The question for which this has been labeled as a duplicate does answer my question. However it is not as clean as this one. That question has some superfluous context, such as that the data was read in from a CSV. This is a clean example, where you can paste the code straight into your own REPL, come up with an answer, and post it. In a very short period of time, this question had more answers than the duplicate nominee. Therefore, I think it makes sense to reopen. Commented Oct 10, 2017 at 13:03
  • The question is the exact same, and the duplicate answer was written by the creator of pandas, so I think its a safe bet that that is the best way to filter a dataframe. Commented Oct 13, 2017 at 14:40
  • Thanks. Humbly noted that I should consider special weight to Pandas questions answered by Wes McKinney. Commented Oct 13, 2017 at 14:47

5 Answers 5

11
In [3]: rand_df.query("A < 0 and B < 0")
Out[3]:
                   A         B         C         D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199  1.202082  0.136657
2017-01-08 -1.445050 -0.367112 -2.617743  0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507

or:

In [6]: rand_df[rand_df[['A','B']].lt(0).all(1)]
Out[6]:
                   A         B         C         D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199  1.202082  0.136657
2017-01-08 -1.445050 -0.367112 -2.617743  0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507

PS You will find a lot of examples in the Pandas docs

Sign up to request clarification or add additional context in comments.

1 Comment

This one captures the essence of the question: avoiding multiple references to the enclosing dataframe
5
rand_df[(rand_df.A < 0) & (rand_df.B <0)]

Comments

4

To use the lambda, don't pass the entire column.

rand_df.loc[lambda x: (x.A < 0) & (x.B < 0)]
# Or
# rand_df[lambda x: (x.A < 0) & (x.B < 0)]

                   A         B         C         D
2017-01-12 -0.460918 -1.001184 -0.796981  0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120

You can speed up the evaluation by using boolean numpy arrays

c1 = rand_df.A.values > 0
c2 = rand_df.B.values > 0
rand_df[c1 & c2]

                   A         B         C         D
2017-01-12 -0.460918 -1.001184 -0.796981  0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120

Comments

3

Here is an approach that “chains” use of the ‘loc’ operation:

rand_df.loc[lambda df: df['A'] < 0].loc[lambda df: df['B'] < 0]

Comments

1

Here is an approach which includes writing a method to do the filtering. I am sure that some filters will be sufficiently complex or complicated that the method is the best way to go (this case is not so complex.) Also, when I am using Pandas and I write a “for” loop, I feel like I am doing it wrong.

def lt_zero_ab(df):
    result = []
    for index, row in df.iterrows():
        if row['A'] <0 and row['B'] <0:
            result.append(index)
    return result
rand_df.loc[lt_zero_ab]

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.