pandas dataframe column based on previous rows

Question

I have a below dataframe

         id  action   
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE

I would like to create a new column "check" that would be based on data in previous rows in dataframe:

Find cell in action column = "DONE"
Search for the first CREATED or UPDATED with the same id in previous rows, before DONE. In case its CREATED then put C in case UPDATED put U.

Output:

         id  action   check
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      C
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      U

I tried to use multiple if conditions but it did not work for me. Can you pls help?

yes, we could have multiple DONE per id, but before every DONE there should be CREATED or UPDTED for that id. — johnt
– johnt, Commented Jun 12, 2020 at 16:43

Shubham Sharma · Accepted Answer · 2020-06-12 18:38:39Z

1

Consider a more sophisticated sample dataframe for illustration:

# print(df)
id  action   
10   CREATED   
10   111
10   222
10   333
10   DONE      
10   222
10   UPDATED   
777  CREATED    
10   333
10   DONE
777  DONE
10   CREATED
10   DONE
11   UPDATED
11   DONE

Use:

transformer = lambda s: s[(s.eq('CREATED') | s.eq('UPDATED')).cumsum().idxmax()]

grouper = (
    lambda g: g.groupby(
        g['action'].eq('DONE').cumsum().shift().fillna(0))['action']
    .transform(transformer)
)

df['check'] = df.groupby('id').apply(grouper).droplevel(0).str[0]
df.loc[df['action'].ne('DONE'), 'check'] = ''

Explanation:

First we group the dataframe on id and apply a grouper function, then for each grouped dataframe we further group this grouped dataframe by the first occurence of DONE in the action column, so essentially we are splitting this grouped dataframe in multiple parts where each part separated from the other by the DONE value in action column. then we use transformer lambda function to transform each of this spllitted dataframes according to the first value (CREATED or UPDATED) that preceds the DONE value in action column.

Result:

# print(df)
     id   action check
0    10  CREATED      
1    10      111      
2    10      222      
3    10      333      
4    10     DONE     C
5    10      222      
6    10  UPDATED      
7   777  CREATED      
8    10      333      
9    10     DONE     U
10  777     DONE     C
11   10  CREATED      
12   10     DONE     C
13   11  UPDATED      
14   11     DONE     U

edited Jun 12, 2020 at 18:38

answered Jun 12, 2020 at 18:04

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Pygirl Over a year ago

It will fail for this: justpaste.it/2vkql. If applying the logic on the consecutive Done.

Shubham Sharma Over a year ago

I guess it will not, the value should be C as it is the first values before the DONE in group 777

Shubham Sharma Over a year ago

And also there can't be two 'DONE consecutively for the same I'd as per the OP.

Pygirl Over a year ago

I didn't get it. Row 4 should be considered nah?

Shubham Sharma Over a year ago

I see, you are thinking the first value from the start, but I was taking the first value from the bottom. I guess this needs to be clarified by the OP.

Always Right Never Left · Accepted Answer · 2020-06-12 17:59:42Z

A loopy solution, not optimal but does the job.

This assumes that rows in your dataframe are ordered in time, and you have a dataframe with 2 columns ['id', 'action'] and an integer index = range(N) where N is the number of columns. Then:

df['check'] = ''
for i, action in zip(df.index, df['action']):
    if action == 'DONE':
        action_id = df.loc[i, 'id']
        prev_action = df.iloc[:i].loc[(df['id'] == action_id) & 
                      (df['action'].isin(['CREATED', 'UPDATED'])), 'action'].iloc[-1]
        if prev_action == 'CREATED':
            df.loc[i, 'check'] = 'C'
        elif prev_action == 'UPDATED':
            df.loc[i, 'check'] = 'U'

Basically we loop through actions, find cases when df['action'] == 'DONE', then get the id associated with the action and then look at the history of actions for this id previous to the current 'DONE' event by calling df.iloc[:i]. Then we narrow down this list to actions which belong to ['CREATED', 'UPDATED'], and then look at the last action in that list, based on which we assign the value to the 'check' column.

Pygirl · Accepted Answer · 2020-06-12 18:42:26Z

I don't know whether it's the best answer but I tried to create my own logic to solve this problem.

1) Get the index of rows where the action is done:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()

df[m]:

    id  action
4   10  DONE
9   10  DONE

idx:

[4, 9]

2) groupby ID and index of all the rows where Action is either CREATED or UPDATED

n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)

n_idx = df[n].index

df[n]:

    id  action
0   10  CREATED
6   10  UPDATED
7   777 CREATED

n_idx:

Int64Index([0, 6, 7], dtype='int64')

3) Fill new column "check" with empty string:

df['check'] = ''

4) Now you have 2 indexes one is for DONE and another is for CREATED/UPDATED. Now you have to check if previous rows having any CREATED/UPDATED keeping in mind that they should have the same id.

ix = [0] + idx # <-- [0, 4, 9]
for a in list(zip(ix, ix[1:])): # <--- will create range (0,4), (4,9)
    for j in (n_idx):
        if j in range(a[0], a[1]): # <--- compare if CREATED/UPDATED indexes fall in this range. (checking previous row) and break if get any of them
            if (df.iloc[a[1]].id==df.iloc[j].id): # <--  check for id
                df.loc[a[1],'check'] = df.loc[j,'action'][0] # <--- assign Action
                break

Final Output:

df:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  222 
3   10  333 
4   10  DONE    C
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   10  333 
9   10  DONE    U

FULL CODE:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()
n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)
n_idx = df[n].index
ix = [0] + idx
df['check'] = ''

for a in list(zip(ix, ix[1:])):
    for j in (n_idx):
        if (j in range(a[0], a[1]+1)) and (df.iloc[a[1]].id==df.iloc[j].id):
            df.loc[a[1],'check'] = df.loc[j,'action'][0]
            break

Sample Data with result:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   10  DONE    
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    C
9   10  DONE

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   777 UPDATED 
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    U
9   10  DONE

Collectives™ on Stack Overflow

pandas dataframe column based on previous rows

3 Answers 3

5 Comments

Comments

Sample Data with result:

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Sample Data with result:

Comments

Your Answer

Sign up or log in

Post as a guest

Related