Python pandas adding sequence along two condition variables

Question

In R one can easily add sequence along the two (or even more) condition variables using ave(), like this:

# create a dataframe
dat = data.frame(
    FactorA = c(rep('a1', 10), rep('a2', 10)),
    FactorB = c(rep('b1', 5), rep('b2', 5), rep('b1', 5), rep('b2', 5)),
    DependentVar = rnorm(20)
)

# add ordering given combination of two factors
dat$Order <- ave(dat$DependentVar, dat$FactorA, dat$FactorB,
    FUN=seq_along)

What would be an analogue in Python with pandas?

Addition on 22/06/2020:

Also, if you would make the levels of FactorA and FactorB interleave by "shuffling" them, like this, for example:

# a slightly "shuffled" dataframe
dat2 = data.frame(
    FactorA = c(rep('a1', 6), rep('a2', 6),
                rep('a1', 4), rep('a2', 4)),
    FactorB = c(rep('b1', 3), rep('b2', 3), rep('b1', 3), rep('b2', 3),
                rep('b1', 2), rep('b2', 2), rep('b1', 2), rep('b2', 2)),
    DependentVar = rnorm(20)
)

ave() would continue to sequence them along:

dat2$Order <- ave(dat2$DependentVar, dat2$FactorA, dat2$FactorB,
    FUN=seq_along)
dat2

   FactorA FactorB DependentVar Order
1       a1      b1    1.3814360     1
2       a1      b1    1.0702582     2
3       a1      b1   -1.1974390     3
4       a1      b2   -1.1687711     1
5       a1      b2   -0.7584645     2
6       a1      b2   -0.5541912     3
7       a2      b1   -0.3083331     1
8       a2      b1    0.7707984     2
9       a2      b1    2.4709730     3
10      a2      b2    0.1768273     1
11      a2      b2    0.5687605     2
12      a2      b2    0.7360105     3
13      a1      b1    0.9253223     4
14      a1      b1   -0.3190011     5
15      a1      b2   -0.2657454     4
16      a1      b2   -0.1617810     5
17      a2      b1    0.9634501     4
18      a2      b1   -0.6749173     5
19      a2      b2    0.8138765     4
20      a2      b2   -1.1075720     5

Can Python (1) mark the "appearance" of the combination and, also, (2) reset the sequencing, like this:

   FactorA FactorB DependentVar Order OrderReset WhichAppearance
1       a1      b1    1.3814360     1          1               1
2       a1      b1    1.0702582     2          2               1
3       a1      b1   -1.1974390     3          3               1
4       a1      b2   -1.1687711     1          1               1
5       a1      b2   -0.7584645     2          2               1
6       a1      b2   -0.5541912     3          3               1
7       a2      b1   -0.3083331     1          1               1
8       a2      b1    0.7707984     2          2               1
9       a2      b1    2.4709730     3          3               1
10      a2      b2    0.1768273     1          1               1
11      a2      b2    0.5687605     2          2               1
12      a2      b2    0.7360105     3          3               1
13      a1      b1    0.9253223     4          1               2
14      a1      b1   -0.3190011     5          2               2
15      a1      b2   -0.2657454     4          1               2
16      a1      b2   -0.1617810     5          2               2
17      a2      b1    0.9634501     4          1               2
18      a2      b1   -0.6749173     5          2               2
19      a2      b2    0.8138765     4          1               2
20      a2      b2   -1.1075720     5          2               2

Scott Boston · Accepted Answer · 2020-06-22 14:44:13Z

1

In Python with pandas, you can do this:

df['Order'] = df_data.groupby(['FactorA', 'FactorB']).cumcount() + 1

MVCE:

import pandas as pd
from io import StringIO
dat_text = StringIO("""   FactorA  FactorB  DependentVar
1       a1      b1   -1.1435908
2       a1      b1   -0.5799404
3       a1      b1    0.0680380
4       a1      b1    0.1143230
5       a1      b1    0.7673287
6       a1      b2    1.4769585
7       a1      b2   -1.3399984
8       a1      b2   -0.4832071
9       a1      b2   -2.3764355
10      a1      b2    0.2668480
11      a2      b1   -0.7376859
12      a2      b1   -0.4141878
13      a2      b1   -0.5159797
14      a2      b1   -1.3888258
15      a2      b1    0.1497270
16      a2      b2    0.1803052
17      a2      b2    0.8547880
18      a2      b2    0.2372080
19      a2      b2    0.3139455
20      a2      b2    0.7266356""")

df_data = pd.read_csv(dat_text, sep='\s\s+', engine='python')

print(df_data)

Output:

   FactorA FactorB  DependentVar
1       a1      b1     -1.143591
2       a1      b1     -0.579940
3       a1      b1      0.068038
4       a1      b1      0.114323
5       a1      b1      0.767329
6       a1      b2      1.476958
7       a1      b2     -1.339998
8       a1      b2     -0.483207
9       a1      b2     -2.376435
10      a1      b2      0.266848
11      a2      b1     -0.737686
12      a2      b1     -0.414188
13      a2      b1     -0.515980
14      a2      b1     -1.388826
15      a2      b1      0.149727
16      a2      b2      0.180305
17      a2      b2      0.854788
18      a2      b2      0.237208
19      a2      b2      0.313945
20      a2      b2      0.726636

Use groupby with cumcount:

df_data['Order'] = df_data.groupby(['FactorA', 'FactorB']).cumcount() + 1

print(df_data)

Output:

   FactorA FactorB  DependentVar  Order
1       a1      b1     -1.143591      1
2       a1      b1     -0.579940      2
3       a1      b1      0.068038      3
4       a1      b1      0.114323      4
5       a1      b1      0.767329      5
6       a1      b2      1.476958      1
7       a1      b2     -1.339998      2
8       a1      b2     -0.483207      3
9       a1      b2     -2.376435      4
10      a1      b2      0.266848      5
11      a2      b1     -0.737686      1
12      a2      b1     -0.414188      2
13      a2      b1     -0.515980      3
14      a2      b1     -1.388826      4
15      a2      b1      0.149727      5
16      a2      b2      0.180305      1
17      a2      b2      0.854788      2
18      a2      b2      0.237208      3
19      a2      b2      0.313945      4
20      a2      b2      0.726636      5

Update to answer "Addition on 22/06/2020":

#Let's create a helper column to define new groups in order of appearance
df['newgroup'] = (df[['FactorA', 'FactorB']] != df[['FactorA', 'FactorB']].shift()).any(axis=1).cumsum()

#Use cumcount to count rows in groups
df['Order Reset'] = df.groupby('newgroup').cumcount() + 1

#Use factorize to count appearances of groups
df['Appearance'] = df.groupby(['FactorA', 'FactorB'])['newgroup'].transform(lambda x: x.factorize()[0]+1)

df

Output:

   FactorA FactorB  DependentVar  Order  newgroup       Order Reset  Appearance
1       a1      b1      1.381436      1         1                 1           1
2       a1      b1      1.070258      2         1                 2           1
3       a1      b1     -1.197439      3         1                 3           1
4       a1      b2     -1.168771      1         2                 1           1
5       a1      b2     -0.758465      2         2                 2           1
6       a1      b2     -0.554191      3         2                 3           1
7       a2      b1     -0.308333      1         3                 1           1
8       a2      b1      0.770798      2         3                 2           1
9       a2      b1      2.470973      3         3                 3           1
10      a2      b2      0.176827      1         4                 1           1
11      a2      b2      0.568761      2         4                 2           1
12      a2      b2      0.736010      3         4                 3           1
13      a1      b1      0.925322      4         5                 1           2
14      a1      b1     -0.319001      5         5                 2           2
15      a1      b2     -0.265745      4         6                 1           2
16      a1      b2     -0.161781      5         6                 2           2
17      a2      b1      0.963450      4         7                 1           2
18      a2      b1     -0.674917      5         7                 2           2
19      a2      b2      0.813877      4         8                 1           2
20      a2      b2     -1.107572      5         8                 2           2

edited Jun 22, 2020 at 14:44

answered Jun 20, 2020 at 19:28

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

striatum Over a year ago

I'm curious: how would you restart cumcount() in the case when two-factor levels would not be consecutive? Currently, the code continues to count consistently. Tnx so much!

Scott Boston Over a year ago

@striatum I don't understand your two-factor levels questions, can you create a sample dataset that includes this aspect of your curiousity?

striatum Over a year ago

Unfortunately, replying here doesn't give all formatting capabilities. However, imagine exactly the same example like yours, but this time rows 4 and 5 come after the row 10. df_data.groupby().cumcount() would just continue to count, as it should. But how would you reset such that new rows 11 and 12 (former 4 and 5) would not get values for ['Order'] 4 and 5 but, restarted 1 and 2. I hope this is clear.

Scott Boston Over a year ago

@striatum modify the question It would be easier.

striatum Over a year ago

@Scott Boston, I did it. Thanks!

|

Collectives™ on Stack Overflow

Python pandas adding sequence along two condition variables

Addition on 22/06/2020:

1 Answer 1

Update to answer "Addition on 22/06/2020":

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Addition on 22/06/2020:

1 Answer 1

Update to answer "Addition on 22/06/2020":

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related