1

Take this df:

df = pd.DataFrame({'client_id':[0, 0, 0, 1, 1, 1, 2, 2, 2],
                   'key':['0_382','0_382','0_356','1_365',float('nan'),'1_365',float('nan'),'2_284','2_405'],
                   'operation':['buy','sell','sell','buy','transfer','buy','fee','buy','buy']})
   client_id    key operation
0          0  0_382       buy
1          0  0_382      sell
2          0  0_356      sell
3          1  1_365       buy
4          1    NaN  transfer
5          1  1_365       buy
6          2    NaN       fee
7          2  2_284       buy
8          2  2_405       buy

I need to create a column named pos_id that will give an incremental value (1,2,3...) for each row, for unique values of client_id and key, and using a conditional to skip transfer and fee values of operation.

The result should be like this:

   client_id    key operation pos_id
0          0  0_382       buy      1
1          0  0_382      sell      1
2          0  0_356      sell      2
3          1  1_365       buy      1
4          1    NaN  transfer    NaN
5          1  1_365       buy      1
6          2    NaN       fee    NaN
7          2  2_284       buy      1
8          2  2_405       buy      2

1 Answer 1

1

Here are two ways.

The first method groups ['client_id', 'key'] to the same 'pos_id' within 'client_id' regardless of whether or not they appear consecutively.

Use where to mask the rows you want to ignore and then groupby + ngroup with sort=False will count the unique combinations. Then subtract off the min within each group to get the counter starting at 1.

s = (df.where(~df['operation'].isin(['transfer', 'fee']))
       .groupby(['client_id', 'key'], sort=False).ngroup()
       .replace(-1, np.NaN))  # ngroup makes NaN group keys -1.

df['pos_id'] = s - s.groupby(df['client_id']).transform('min') + 1

This method requires the input sorted on at least 'client_id' and then will only group the same key if they are consecutive into the same 'pos_id'. Remove the rows you want to ignore, then check for differences in each row and take the cumsum within 'client_id'

s = (df.where(~df['operation'].isin(['transfer', 'fee']))
       .dropna(how='all'))

s = s['key'].ne(s['key'].shift()) | s['client_id'].ne(s['client_id'].shift())
df['pos_id'] = s.groupby(df['client_id']).cumsum()

For your input either results in:

   client_id    key operation  pos_id
0          0  0_382       buy     1.0
1          0  0_382      sell     1.0
2          0  0_356      sell     2.0
3          1  1_365       buy     1.0
4          1    NaN  transfer     NaN
5          1  1_365       buy     1.0
6          2    NaN       fee     NaN
7          2  2_284       buy     1.0
8          2  2_405       buy     2.0
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.