Pandas DataFrame filtering of groups of rows on multiple columns

Question

Here's a simplified version of my dataframe:

d = {'col1': ['a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'c1', 'c2', 'c3', 'd1', 'd2', 'd3'], 'col2': [1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, 1], 'col3': [-1, -1, 1, -1, -1, 1, 1, 1, 1, -1, 1, 1]}
df = pd.DataFrame(d)
df

    col1    col2    col3
0   a1       1      -1
1   a2       1      -1
2   a3       1       1
3   b1      -1      -1
4   b2      -1      -1
5   b3      -1       1
6   c1      -1       1
7   c2       1       1
8   c3       1       1
9   d1      -1      -1
10  d2       1      -1
11  d3       1       1

i would like to be able to pull out only those rows where col3 == 1 for the first time n rows after col2 == 1 for the first time, for each letter group.

so for example, if we're looking for when col3 became 1 one row after col2 became 1 (for each letter group), we'll get

    col1    col2    col3
0   d3      1       1

because for group d col2 turned from -1 to 1 at d2 and col3 turned from -1 to 1 at d3. And that hasn't happened in any other group.

if we want rows where col3 became 1 two rows after col2 became 1 (for each letter group), we'll get

    col1    col2    col3
0   a3      1       1

because for group a col2 started with 1 at a1 and col3 turned from -1 to 1 at a3.

Edit:

Here's my awkward way of doing it ... anyone got more elegant solutions?

df['newCol'] = (
           (((df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1)) &
           (df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1))) |
           (df['col1'].shift(n+1).str[0] != df['col1'].str[0])) &
           (df['col2'].shift(n) == 1) &
           (df['col3'].shift(n) == -1) &
           (df['col2'].shift(1) == 1) &
           (df['col3'].shift(1) == -1) &
           (df['col2'] == 1) &
           (df['col3'] == 1) &
           (df['col1'].shift(n).str[0] == df['col1'].str[0])) if n > 0 \
            else \
           ((((df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1)) &
           (df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1))) |
           (df['col1'].shift(n+1).str[0] != df['col1'].str[0])) &
           (df['col2'] == 1) &
           (df['col3'] == 1))

I would recommend reading some of the Pandas documentation on how to filter on dataframes. You'll be able to answer this relatively quickly. — mrp
– mrp, Commented Oct 19, 2021 at 19:03
@mrp i know how to do basic filtering. This however I have found to be a challenge. — Raksha
– Raksha, Commented Oct 19, 2021 at 19:04
A perhaps more performant way would be to create a new column that is a lag of the conditional column using shift(). So then you can filter using pandas standard filters on arrays, which will be faster if your dataframe is very large. — mrp
– mrp, Commented Oct 19, 2021 at 19:14

mrp · Accepted Answer · 2021-10-19 19:22:57Z

1

To put my last comment into an answer. Create a new column that is a lag using n, then just filter the standard way and grab the first value of col1.

n = 2
df['newCol'] = df['col2'].shift(n)
df.loc[(df['col3'] == 1) & (df['newCol'] == 1), ['col1']].values[0]

You can wrap this into a function and make everything parameters.

answered Oct 19, 2021 at 19:22

mrp

7212 gold badges12 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Raksha Over a year ago

----- Edit nm, i was trying to just get the rows, and forgot the values[0] part... I guess i'd need to convert array back to dataframe afterwards? ----- Close, but not quite. Try this: d = {'col1': ['a1', 'a2', 'a3', 'a4', 'b1', 'b2', 'b3', 'b4', 'c1', 'c2', 'c3', 'c4', 'd1', 'd2', 'd3', 'd4'],'col2': [1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1],'col3': [-1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1]}. That should only be a3, but it comes back with a3, a4, c4, d2, d3, d4.

Raksha Over a year ago

lemme mess with it more, i still feel like it's not doing it exactly >.<

Scott Boston · Accepted Answer · 2021-10-20 00:43:12Z

1

Try this:

n=2
cond = pd.concat([(df['col2'] == 1).groupby(df['col1'].str[0]).cumsum().shift(n),
                  (df['col3'] == 1).groupby(df['col1'].str[0]).cumsum()], 
                 axis=1)\
         .eq(1)\
         .all(axis=1)
df[cond]

Output:

  col1  col2  col3
2   a3     1     1

Or more simply I think:

cond1 = (df['col2'] == 1).groupby(df['col1'].str[0]).cumsum().shift(n) == 1
cond2 = (df['col3'] == 1).groupby(df['col1'].str[0]).cumsum() == 1
df[cond1 & cond2]

edited Oct 20, 2021 at 0:43

answered Oct 20, 2021 at 0:38

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

2 Comments

Raksha Over a year ago

First script almost works, but n=3 still returns d2 where it shouldn't return anything. Second script doesn't seem to work at all.

Scott Boston Over a year ago

@Raksa when I run n=3 I am getting empty dataframes. Anyway, I think this is an approach you can troubleshoot with your data.

Collectives™ on Stack Overflow

Pandas DataFrame filtering of groups of rows on multiple columns

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related