SQL to Pandas - Aggregation over Partition python

Question

what is the best way to aggregate values based on a particular over partition by :

SQL :

select 
a.*, 
b.vol1 / sum(vol1) over (
  partition by a.sale, a.d_id, 
  a.month, a.p_id
) vol_r, 
a.vol2* b.vol1/ sum(b.vol1) over (
  partition by a.sale, a.d_id, 
  a.month, a.p_id
) vol_t
from 
sales1 a 
left join sales2 b on a.sale = b.sale
and a.d_id = b.d_id 
and a.month = b.month
and a.p_id = b.p_id

what would be the equivalent of this to pandas python?

Input :

sales1 :

sale	d_id	month	p_id	vol2
2	580	4	9	11
2	580	4	9	11.314
2	580	4	9	20.065

sales2 :

sale	d_id	month	p_id	vol1
2	580	4	9	11
2	580	4	9	11.314
2	580	4	9	21

output :

sale	d_id	month	p_id	vol1	vol2	vol_r	vol_t
2	580	4	9	11	11	1	11
2	580	4	9	11.314	11.314	1	11.314
2	580	4	9	21	20.065	1	20.065

can you ask the question in the proper way, with a sample dataframe, and expected output — sammywemmy
– sammywemmy, Commented Aug 20, 2021 at 7:17
kindly provide code that can be copied. you can try : df.to_dict('records') for both dataframes — sammywemmy
– sammywemmy, Commented Aug 25, 2021 at 5:39

sammywemmy · Accepted Answer · 2021-08-25 06:46:22Z

2

The first part is the join, similar to the left join in your sql code. One thing I noticed is that four columns are repeated in your code : 'sale', 'd_id', 'month', 'p_id', in the joins and windowing. In sql, you can just create a window reference at the end of your code and reuse. In python, you can store it in a variable and reuse (gives a clean look). I also use these values as index, since at some point, there will be a windowing operation (again, the reuse):

index = ['sale', 'd_id', 'month', 'p_id']

df1 = df1.set_index(index)

df2 = df2.set_index(index)

merged = df1.join(df2, how='left')

Next, groupby on the index and get the aggregate sum for vol1. Since we need the aggregate aligned to each row, in pandas the transform helps with that:

grouped = merged.groupby(index)
partitioned_sum = grouped.vol1.transform('sum')

From here, we can create vol_r and vol_t via the assign method, and drop the vol1 column:

(merged.assign(vol_r = merged.vol1.div(partitioned_sum), 
               vol_t = lambda df: df.vol_r.mul(df.vol2))
       .drop(columns='vol1')
       .reset_index()
)

   sale  d_id  month  p_id    vol2     vol_r     vol_t
0     2   580      4     9  11.000  0.084653  0.931185
1     2   580      4     9  11.000  0.087070  0.957766
2     2   580      4     9  11.000  0.161611  1.777716
3     2   580      4     9  11.314  0.084653  0.957766
4     2   580      4     9  11.314  0.087070  0.985106
5     2   580      4     9  11.314  0.161611  1.828462
6     2   580      4     9  20.065  0.084653  1.698566
7     2   580      4     9  20.065  0.087070  1.747052
8     2   580      4     9  20.065  0.161611  3.242716

answered Aug 25, 2021 at 6:46

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

NITIN KOTHARI Over a year ago

here your output differs right, I have 3 rows in my output you have 9. I understand that when you merge as there are same rows on which we are joining but it doesn't happen when I execute sql

sammywemmy Over a year ago

i ran your sql code before running the pandas code and it returned 9 rows

NITIN KOTHARI Over a year ago

can you please also tell me how will I assign this to a new column in dataframe

sammywemmy Over a year ago

you assign a new column with assign. The pandas docs is your friend.

Collectives™ on Stack Overflow

SQL to Pandas - Aggregation over Partition python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related