Create new variable based on change in another variable in Python

Question

import pandas as pd

df = {'Date': ["2011-10-19", 
              "2013-01-14", 
              "2014-05-27",
              "2014-06-23",
              "2014-08-12",
              "2014-09-22",
              "2014-09-22",
               "2014-09-22"
             ], 'Status': ["Pending", 
                           "Pending", 
                           "Complete", 
                           "Pending",
                          "Complete",
                           "Pending", 
                           "Pending", 
                           "Pending"],
             'Group': ["a",
                       "a",
                       "a",
                       "a", 
                       "b",
                       "b",
                       "b",
                       "b"]}
df = pd.DataFrame(data=df)
df

I would like to create another variable based on the change in Status over time for each group such that they are considered a "completer" the next row after they have Status = "Complete"

For example I would like to create the "completer" column in the df2 table:

df2 = {'Date': ["2011-10-19", 
              "2013-01-14", 
              "2014-05-27",
              "2014-06-23",
              "2014-08-12",
              "2014-09-22",
              "2014-09-22",
               "2014-09-22"
             ], 'Status': ["Pending", 
                           "Pending", 
                           "Complete", 
                           "Pending",
                          "Complete",
                           "Pending", 
                           "Pending", 
                           "Pending"],
             'Group': ["a",
                       "a",
                       "a",
                       "a", 
                       "b",
                       "b",
                       "b",
                       "b"],
             'Completer': ["Non-Completer",
                          "Non-Completer",
                          "Non-Completer",
                          "Completer",
                          "Non-Completer",
                          "Completer",
                          "Completer",
                          "Completer"]}
df2 = pd.DataFrame(data=df2)
df2

Thanks!

I've sorted by Date so that the date each row is an increase in date from Oct 19 2011 to Sept 22 2014. I'd like to group by "Group" and create a completer variable that captures the change in Status such that any row within the group after Status = Complete the new column is Completer == "Completer". For example, in df2 row 4 the completer variable is now "completer" because the previous row (an earlier date) Status == Complete. — Kreitz Gigs
– Kreitz Gigs, Commented Aug 22, 2022 at 17:26

sophocles · Accepted Answer · 2022-08-22 17:29:50Z

1

I was able to solve it in two steps.

Firstly I created a column which marks the first date that each group has "Completed", and added 1 to it so that we update values starting from the next row.

Secondly, since True / False translates to 1 / 0 respectively I used a groupby on Group and cummax, which would update all following rows per group to True.

Then lastly I just used replace and dropped the helping column.

df['first_date_per_group'] = df.index.isin(df.loc[df['Status'].eq('Complete')]['Group'].index + 1)

df = df.assign(Completer=df.groupby('Group')['first_date_per_group'].cummax()).replace(
    {True:'Completer',False:'Non-Completer'}).drop('first_date_per_group',axis=1)

prints:

         Date    Status Group      Completer
0  2011-10-19   Pending     a  Non-Completer
1  2013-01-14   Pending     a  Non-Completer
2  2014-05-27  Complete     a  Non-Completer
3  2014-06-23   Pending     a      Completer
4  2014-08-12  Complete     b  Non-Completer
5  2014-09-22   Pending     b      Completer
6  2014-09-22   Pending     b      Completer
7  2014-09-22   Pending     b      Completer

edited Aug 22, 2022 at 17:29

answered Aug 22, 2022 at 17:27

sophocles

13.9k3 gold badges18 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Kreitz Gigs Over a year ago

It doesn't look like its picking up on the Status change. I had one group where all of the status was Pending and the first row of that group was Non-completer then all the other rows were Completer.

Kreitz Gigs Over a year ago

I'll double check my translation of the code is correct.

Kreitz Gigs Over a year ago

Oh maybe the issue is for groups where all of the rows are pending... I'll try to recreate

Naveed · Accepted Answer · 2022-08-22 17:32:28Z

use transform on the grouped data and assign 1 or nan, then ffill based on the group. this leaves the values from the past as null. Use that to fill the column as completer or non-completer

df['completer']=df.groupby('Group')['Status'].transform(
    lambda row: np.where(row.shift(1).eq('Complete'), 1, np.nan ) )
df['completer']=df.groupby('Group')['completer'].ffill()
df['completer'] = np.where(df['completer'].isna(), 'non-completer', 'completer')
df

Date    Status  Group   completer
0   2011-10-19  Pending     a   non-completer
1   2013-01-14  Pending     a   non-completer
2   2014-05-27  Complete    a   non-completer
3   2014-06-23  Pending     a   completer
4   2014-08-12  Complete    b   non-completer
5   2014-09-22  Pending     b   completer
6   2014-09-22  Pending     b   completer
7   2014-09-22  Pending     b   completer

SomeDude · Accepted Answer · 2022-08-22 18:33:11Z

You can define a method to ffill the 'completer' and then change the value to 'non-completer' and then do bfill

def fill_completer(g):
    g.loc[g['Status']=='Complete', 'Completer'] = 'Completer'
    g['Completer'] = g['Completer'].ffill()
    g.loc[g['Status']=='Complete', 'Completer'] = 'Non-Completer'
    g['Completer'] = g['Completer'].bfill()
    
    return g

Then apply it to each group as :

df['Completer'] = np.nan
df = df.groupby('Group').apply(fill_completer)

print(df):

         Date    Status Group      Completer
0  2011-10-19   Pending     a  Non-Completer
1  2013-01-14   Pending     a  Non-Completer
2  2014-05-27  Complete     a  Non-Completer
3  2014-06-23   Pending     a      Completer
4  2014-08-12  Complete     b  Non-Completer
5  2014-09-22   Pending     b      Completer
6  2014-09-22   Pending     b      Completer
7  2014-09-22   Pending     b      Completer

fsimonjetz · Accepted Answer · 2022-08-22 21:28:51Z

0

Late, but I'd still like to add another, quite readable approach I came up with:

df['Completer'] = (df.Status.shift()  # shift Status down by one row
                     .eq("Complete")  # mark "Complete" rows
                     .groupby(df.Group).cumsum()  # cumulative sum per group
                     .map({0: "Non-Completer", 1: "Completer"})  # replace 0s and 1s
                  )

answered Aug 22, 2022 at 21:28

fsimonjetz

5,7923 gold badges7 silver badges23 bronze badges

Collectives™ on Stack Overflow

Create new variable based on change in another variable in Python

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related