0
import pandas as pd

df = {'Date': ["2011-10-19", 
              "2013-01-14", 
              "2014-05-27",
              "2014-06-23",
              "2014-08-12",
              "2014-09-22",
              "2014-09-22",
               "2014-09-22"
             ], 'Status': ["Pending", 
                           "Pending", 
                           "Complete", 
                           "Pending",
                          "Complete",
                           "Pending", 
                           "Pending", 
                           "Pending"],
             'Group': ["a",
                       "a",
                       "a",
                       "a", 
                       "b",
                       "b",
                       "b",
                       "b"]}
df = pd.DataFrame(data=df)
df

I would like to create another variable based on the change in Status over time for each group such that they are considered a "completer" the next row after they have Status = "Complete"

For example I would like to create the "completer" column in the df2 table:

df2 = {'Date': ["2011-10-19", 
              "2013-01-14", 
              "2014-05-27",
              "2014-06-23",
              "2014-08-12",
              "2014-09-22",
              "2014-09-22",
               "2014-09-22"
             ], 'Status': ["Pending", 
                           "Pending", 
                           "Complete", 
                           "Pending",
                          "Complete",
                           "Pending", 
                           "Pending", 
                           "Pending"],
             'Group': ["a",
                       "a",
                       "a",
                       "a", 
                       "b",
                       "b",
                       "b",
                       "b"],
             'Completer': ["Non-Completer",
                          "Non-Completer",
                          "Non-Completer",
                          "Completer",
                          "Non-Completer",
                          "Completer",
                          "Completer",
                          "Completer"]}
df2 = pd.DataFrame(data=df2)
df2

Thanks!

1
  • I've sorted by Date so that the date each row is an increase in date from Oct 19 2011 to Sept 22 2014. I'd like to group by "Group" and create a completer variable that captures the change in Status such that any row within the group after Status = Complete the new column is Completer == "Completer". For example, in df2 row 4 the completer variable is now "completer" because the previous row (an earlier date) Status == Complete. Commented Aug 22, 2022 at 17:26

4 Answers 4

1

I was able to solve it in two steps.

Firstly I created a column which marks the first date that each group has "Completed", and added 1 to it so that we update values starting from the next row.

Secondly, since True / False translates to 1 / 0 respectively I used a groupby on Group and cummax, which would update all following rows per group to True.

Then lastly I just used replace and dropped the helping column.

df['first_date_per_group'] = df.index.isin(df.loc[df['Status'].eq('Complete')]['Group'].index + 1)

df = df.assign(Completer=df.groupby('Group')['first_date_per_group'].cummax()).replace(
    {True:'Completer',False:'Non-Completer'}).drop('first_date_per_group',axis=1)

prints:

         Date    Status Group      Completer
0  2011-10-19   Pending     a  Non-Completer
1  2013-01-14   Pending     a  Non-Completer
2  2014-05-27  Complete     a  Non-Completer
3  2014-06-23   Pending     a      Completer
4  2014-08-12  Complete     b  Non-Completer
5  2014-09-22   Pending     b      Completer
6  2014-09-22   Pending     b      Completer
7  2014-09-22   Pending     b      Completer
Sign up to request clarification or add additional context in comments.

3 Comments

It doesn't look like its picking up on the Status change. I had one group where all of the status was Pending and the first row of that group was Non-completer then all the other rows were Completer.
I'll double check my translation of the code is correct.
Oh maybe the issue is for groups where all of the rows are pending... I'll try to recreate
1

use transform on the grouped data and assign 1 or nan, then ffill based on the group. this leaves the values from the past as null. Use that to fill the column as completer or non-completer

df['completer']=df.groupby('Group')['Status'].transform(
    lambda row: np.where(row.shift(1).eq('Complete'), 1, np.nan ) )
df['completer']=df.groupby('Group')['completer'].ffill()
df['completer'] = np.where(df['completer'].isna(), 'non-completer', 'completer')
df
Date    Status  Group   completer
0   2011-10-19  Pending     a   non-completer
1   2013-01-14  Pending     a   non-completer
2   2014-05-27  Complete    a   non-completer
3   2014-06-23  Pending     a   completer
4   2014-08-12  Complete    b   non-completer
5   2014-09-22  Pending     b   completer
6   2014-09-22  Pending     b   completer
7   2014-09-22  Pending     b   completer

Comments

0

You can define a method to ffill the 'completer' and then change the value to 'non-completer' and then do bfill

def fill_completer(g):
    g.loc[g['Status']=='Complete', 'Completer'] = 'Completer'
    g['Completer'] = g['Completer'].ffill()
    g.loc[g['Status']=='Complete', 'Completer'] = 'Non-Completer'
    g['Completer'] = g['Completer'].bfill()
    
    return g

Then apply it to each group as :

df['Completer'] = np.nan
df = df.groupby('Group').apply(fill_completer)

print(df):

         Date    Status Group      Completer
0  2011-10-19   Pending     a  Non-Completer
1  2013-01-14   Pending     a  Non-Completer
2  2014-05-27  Complete     a  Non-Completer
3  2014-06-23   Pending     a      Completer
4  2014-08-12  Complete     b  Non-Completer
5  2014-09-22   Pending     b      Completer
6  2014-09-22   Pending     b      Completer
7  2014-09-22   Pending     b      Completer

Comments

0

Late, but I'd still like to add another, quite readable approach I came up with:

df['Completer'] = (df.Status.shift()  # shift Status down by one row
                     .eq("Complete")  # mark "Complete" rows
                     .groupby(df.Group).cumsum()  # cumulative sum per group
                     .map({0: "Non-Completer", 1: "Completer"})  # replace 0s and 1s
                  )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.