How do I count inside a dataframe using unique values and conditional?

Question

Take this df:

df = pd.DataFrame({'client_id':[0, 0, 0, 1, 1, 1, 2, 2, 2],
                   'key':['0_382','0_382','0_356','1_365',float('nan'),'1_365',float('nan'),'2_284','2_405'],
                   'operation':['buy','sell','sell','buy','transfer','buy','fee','buy','buy']})

   client_id    key operation
0          0  0_382       buy
1          0  0_382      sell
2          0  0_356      sell
3          1  1_365       buy
4          1    NaN  transfer
5          1  1_365       buy
6          2    NaN       fee
7          2  2_284       buy
8          2  2_405       buy

I need to create a column named pos_id that will give an incremental value (1,2,3...) for each row, for unique values of client_id and key, and using a conditional to skip transfer and fee values of operation.

The result should be like this:

   client_id    key operation pos_id
0          0  0_382       buy      1
1          0  0_382      sell      1
2          0  0_356      sell      2
3          1  1_365       buy      1
4          1    NaN  transfer    NaN
5          1  1_365       buy      1
6          2    NaN       fee    NaN
7          2  2_284       buy      1
8          2  2_405       buy      2

ALollz · Accepted Answer · 2021-01-28 23:07:03Z

Here are two ways.

The first method groups ['client_id', 'key'] to the same 'pos_id' within 'client_id' regardless of whether or not they appear consecutively.

Use where to mask the rows you want to ignore and then groupby + ngroup with sort=False will count the unique combinations. Then subtract off the min within each group to get the counter starting at 1.

s = (df.where(~df['operation'].isin(['transfer', 'fee']))
       .groupby(['client_id', 'key'], sort=False).ngroup()
       .replace(-1, np.NaN))  # ngroup makes NaN group keys -1.

df['pos_id'] = s - s.groupby(df['client_id']).transform('min') + 1

This method requires the input sorted on at least 'client_id' and then will only group the same key if they are consecutive into the same 'pos_id'. Remove the rows you want to ignore, then check for differences in each row and take the cumsum within 'client_id'

s = (df.where(~df['operation'].isin(['transfer', 'fee']))
       .dropna(how='all'))

s = s['key'].ne(s['key'].shift()) | s['client_id'].ne(s['client_id'].shift())
df['pos_id'] = s.groupby(df['client_id']).cumsum()

For your input either results in:

   client_id    key operation  pos_id
0          0  0_382       buy     1.0
1          0  0_382      sell     1.0
2          0  0_356      sell     2.0
3          1  1_365       buy     1.0
4          1    NaN  transfer     NaN
5          1  1_365       buy     1.0
6          2    NaN       fee     NaN
7          2  2_284       buy     1.0
8          2  2_405       buy     2.0

Collectives™ on Stack Overflow

How do I count inside a dataframe using unique values and conditional?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related