6

I have a below dataframe

         id  action   
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      

I would like to create a new column "check" that would be based on data in previous rows in dataframe:

  1. Find cell in action column = "DONE"
  2. Search for the first CREATED or UPDATED with the same id in previous rows, before DONE. In case its CREATED then put C in case UPDATED put U.

Output:

         id  action   check
         ================
         10   CREATED   
         10   111
         10   222
         10   333
         10   DONE      C
         10   222
         10   UPDATED   
         777  CREATED    
         10   333
         10   DONE      U

I tried to use multiple if conditions but it did not work for me. Can you pls help?

2
  • Can there be multiple DONE values per id? Commented Jun 12, 2020 at 16:42
  • yes, we could have multiple DONE per id, but before every DONE there should be CREATED or UPDTED for that id. Commented Jun 12, 2020 at 16:43

3 Answers 3

1

Consider a more sophisticated sample dataframe for illustration:

# print(df)
id  action   
10   CREATED   
10   111
10   222
10   333
10   DONE      
10   222
10   UPDATED   
777  CREATED    
10   333
10   DONE
777  DONE
10   CREATED
10   DONE
11   UPDATED
11   DONE     

Use:

transformer = lambda s: s[(s.eq('CREATED') | s.eq('UPDATED')).cumsum().idxmax()]

grouper = (
    lambda g: g.groupby(
        g['action'].eq('DONE').cumsum().shift().fillna(0))['action']
    .transform(transformer)
)

df['check'] = df.groupby('id').apply(grouper).droplevel(0).str[0]
df.loc[df['action'].ne('DONE'), 'check'] = ''

Explanation:

First we group the dataframe on id and apply a grouper function, then for each grouped dataframe we further group this grouped dataframe by the first occurence of DONE in the action column, so essentially we are splitting this grouped dataframe in multiple parts where each part separated from the other by the DONE value in action column. then we use transformer lambda function to transform each of this spllitted dataframes according to the first value (CREATED or UPDATED) that preceds the DONE value in action column.

Result:

# print(df)
     id   action check
0    10  CREATED      
1    10      111      
2    10      222      
3    10      333      
4    10     DONE     C
5    10      222      
6    10  UPDATED      
7   777  CREATED      
8    10      333      
9    10     DONE     U
10  777     DONE     C
11   10  CREATED      
12   10     DONE     C
13   11  UPDATED      
14   11     DONE     U
Sign up to request clarification or add additional context in comments.

5 Comments

It will fail for this: justpaste.it/2vkql. If applying the logic on the consecutive Done.
I guess it will not, the value should be C as it is the first values before the DONE in group 777
And also there can't be two 'DONE consecutively for the same I'd as per the OP.
I didn't get it. Row 4 should be considered nah?
I see, you are thinking the first value from the start, but I was taking the first value from the bottom. I guess this needs to be clarified by the OP.
0

A loopy solution, not optimal but does the job.

This assumes that rows in your dataframe are ordered in time, and you have a dataframe with 2 columns ['id', 'action'] and an integer index = range(N) where N is the number of columns. Then:

df['check'] = ''
for i, action in zip(df.index, df['action']):
    if action == 'DONE':
        action_id = df.loc[i, 'id']
        prev_action = df.iloc[:i].loc[(df['id'] == action_id) & 
                      (df['action'].isin(['CREATED', 'UPDATED'])), 'action'].iloc[-1]
        if prev_action == 'CREATED':
            df.loc[i, 'check'] = 'C'
        elif prev_action == 'UPDATED':
            df.loc[i, 'check'] = 'U'

Basically we loop through actions, find cases when df['action'] == 'DONE', then get the id associated with the action and then look at the history of actions for this id previous to the current 'DONE' event by calling df.iloc[:i]. Then we narrow down this list to actions which belong to ['CREATED', 'UPDATED'], and then look at the last action in that list, based on which we assign the value to the 'check' column.

Comments

0

I don't know whether it's the best answer but I tried to create my own logic to solve this problem.

1) Get the index of rows where the action is done:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()

df[m]:

    id  action
4   10  DONE
9   10  DONE

idx:

[4, 9]

2) groupby ID and index of all the rows where Action is either CREATED or UPDATED

n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)

n_idx = df[n].index

df[n]:

    id  action
0   10  CREATED
6   10  UPDATED
7   777 CREATED

n_idx:

Int64Index([0, 6, 7], dtype='int64')

3) Fill new column "check" with empty string:

df['check'] = ''

4) Now you have 2 indexes one is for DONE and another is for CREATED/UPDATED. Now you have to check if previous rows having any CREATED/UPDATED keeping in mind that they should have the same id.

ix = [0] + idx # <-- [0, 4, 9]
for a in list(zip(ix, ix[1:])): # <--- will create range (0,4), (4,9)
    for j in (n_idx):
        if j in range(a[0], a[1]): # <--- compare if CREATED/UPDATED indexes fall in this range. (checking previous row) and break if get any of them
            if (df.iloc[a[1]].id==df.iloc[j].id): # <--  check for id
                df.loc[a[1],'check'] = df.loc[j,'action'][0] # <--- assign Action
                break

Final Output:

df:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  222 
3   10  333 
4   10  DONE    C
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   10  333 
9   10  DONE    U

FULL CODE:

m = df.groupby(['id'])['action'].transform(list).eq('DONE')
idx = df[m].index.values.tolist()
n = df.groupby(['id'])['action'].transform(list).str.contains('CREATED|UPDATED', case=False)
n_idx = df[n].index
ix = [0] + idx
df['check'] = ''

for a in list(zip(ix, ix[1:])):
    for j in (n_idx):
        if (j in range(a[0], a[1]+1)) and (df.iloc[a[1]].id==df.iloc[j].id):
            df.loc[a[1],'check'] = df.loc[j,'action'][0]
            break

Sample Data with result:

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   10  DONE    
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    C
9   10  DONE    

    id  action  check
0   10  CREATED 
1   10  111 
2   10  DONE    C
3   10  333 
4   777 UPDATED 
5   10  222 
6   10  UPDATED 
7   777 CREATED 
8   777 DONE    U
9   10  DONE    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.