0

Here's a simplified version of my dataframe:

d = {'col1': ['a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'c1', 'c2', 'c3', 'd1', 'd2', 'd3'], 'col2': [1, 1, 1, -1, -1, -1, -1, 1, 1, 1, 1, 1], 'col3': [-1, -1, 1, -1, -1, 1, 1, 1, 1, -1, 1, 1]}
df = pd.DataFrame(d)
df
    col1    col2    col3
0   a1       1      -1
1   a2       1      -1
2   a3       1       1
3   b1      -1      -1
4   b2      -1      -1
5   b3      -1       1
6   c1      -1       1
7   c2       1       1
8   c3       1       1
9   d1      -1      -1
10  d2       1      -1
11  d3       1       1

i would like to be able to pull out only those rows where col3 == 1 for the first time n rows after col2 == 1 for the first time, for each letter group.

so for example, if we're looking for when col3 became 1 one row after col2 became 1 (for each letter group), we'll get

    col1    col2    col3
0   d3      1       1

because for group d col2 turned from -1 to 1 at d2 and col3 turned from -1 to 1 at d3. And that hasn't happened in any other group.

if we want rows where col3 became 1 two rows after col2 became 1 (for each letter group), we'll get

    col1    col2    col3
0   a3      1       1

because for group a col2 started with 1 at a1 and col3 turned from -1 to 1 at a3.

Edit:

Here's my awkward way of doing it ... anyone got more elegant solutions?

df['newCol'] = (
           (((df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1)) &
           (df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1))) |
           (df['col1'].shift(n+1).str[0] != df['col1'].str[0])) &
           (df['col2'].shift(n) == 1) &
           (df['col3'].shift(n) == -1) &
           (df['col2'].shift(1) == 1) &
           (df['col3'].shift(1) == -1) &
           (df['col2'] == 1) &
           (df['col3'] == 1) &
           (df['col1'].shift(n).str[0] == df['col1'].str[0])) if n > 0 \
            else \
           ((((df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1)) &
           (df['col2'].shift(n+1).isnull() | (df['col2'].shift(n+1) == -1))) |
           (df['col1'].shift(n+1).str[0] != df['col1'].str[0])) &
           (df['col2'] == 1) &
           (df['col3'] == 1))
    
7
  • 1
    col1 is your label, do you mean after col2 beame 1? Commented Oct 19, 2021 at 18:50
  • @QuangHoang sorry, yes, just fixed it Commented Oct 19, 2021 at 18:50
  • I would recommend reading some of the Pandas documentation on how to filter on dataframes. You'll be able to answer this relatively quickly. Commented Oct 19, 2021 at 19:03
  • @mrp i know how to do basic filtering. This however I have found to be a challenge. Commented Oct 19, 2021 at 19:04
  • 1
    A perhaps more performant way would be to create a new column that is a lag of the conditional column using shift(). So then you can filter using pandas standard filters on arrays, which will be faster if your dataframe is very large. Commented Oct 19, 2021 at 19:14

2 Answers 2

1

To put my last comment into an answer. Create a new column that is a lag using n, then just filter the standard way and grab the first value of col1.

n = 2
df['newCol'] = df['col2'].shift(n)
df.loc[(df['col3'] == 1) & (df['newCol'] == 1), ['col1']].values[0]

You can wrap this into a function and make everything parameters.

Sign up to request clarification or add additional context in comments.

2 Comments

----- Edit nm, i was trying to just get the rows, and forgot the values[0] part... I guess i'd need to convert array back to dataframe afterwards? ----- Close, but not quite. Try this: d = {'col1': ['a1', 'a2', 'a3', 'a4', 'b1', 'b2', 'b3', 'b4', 'c1', 'c2', 'c3', 'c4', 'd1', 'd2', 'd3', 'd4'],'col2': [1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1],'col3': [-1, -1, 1, 1, -1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1]}. That should only be a3, but it comes back with a3, a4, c4, d2, d3, d4.
lemme mess with it more, i still feel like it's not doing it exactly >.<
1

Try this:

n=2
cond = pd.concat([(df['col2'] == 1).groupby(df['col1'].str[0]).cumsum().shift(n),
                  (df['col3'] == 1).groupby(df['col1'].str[0]).cumsum()], 
                 axis=1)\
         .eq(1)\
         .all(axis=1)
df[cond]

Output:

  col1  col2  col3
2   a3     1     1

Or more simply I think:

cond1 = (df['col2'] == 1).groupby(df['col1'].str[0]).cumsum().shift(n) == 1
cond2 = (df['col3'] == 1).groupby(df['col1'].str[0]).cumsum() == 1
df[cond1 & cond2]

2 Comments

First script almost works, but n=3 still returns d2 where it shouldn't return anything. Second script doesn't seem to work at all.
@Raksa when I run n=3 I am getting empty dataframes. Anyway, I think this is an approach you can troubleshoot with your data.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.