0

what is the best way to aggregate values based on a particular over partition by :

SQL :

select 
a.*, 
b.vol1 / sum(vol1) over (
  partition by a.sale, a.d_id, 
  a.month, a.p_id
) vol_r, 
a.vol2* b.vol1/ sum(b.vol1) over (
  partition by a.sale, a.d_id, 
  a.month, a.p_id
) vol_t
from 
sales1 a 
left join sales2 b on a.sale = b.sale
and a.d_id = b.d_id 
and a.month = b.month
and a.p_id = b.p_id

what would be the equivalent of this to pandas python?

Input :

sales1 :

sale d_id month p_id vol2
2 580 4 9 11
2 580 4 9 11.314
2 580 4 9 20.065

sales2 :

sale d_id month p_id vol1
2 580 4 9 11
2 580 4 9 11.314
2 580 4 9 21

output :

sale d_id month p_id vol1 vol2 vol_r vol_t
2 580 4 9 11 11 1 11
2 580 4 9 11.314 11.314 1 11.314
2 580 4 9 21 20.065 1 20.065
4
  • 1
    can you ask the question in the proper way, with a sample dataframe, and expected output Commented Aug 20, 2021 at 7:17
  • @sammywemmy I have added the sample inputs and output Commented Aug 25, 2021 at 5:29
  • kindly provide code that can be copied. you can try : df.to_dict('records') for both dataframes Commented Aug 25, 2021 at 5:39
  • @sammywemmy please take a look thanks. Commented Aug 25, 2021 at 6:16

1 Answer 1

2

The first part is the join, similar to the left join in your sql code. One thing I noticed is that four columns are repeated in your code : 'sale', 'd_id', 'month', 'p_id', in the joins and windowing. In sql, you can just create a window reference at the end of your code and reuse. In python, you can store it in a variable and reuse (gives a clean look). I also use these values as index, since at some point, there will be a windowing operation (again, the reuse):

index = ['sale', 'd_id', 'month', 'p_id']

df1 = df1.set_index(index)

df2 = df2.set_index(index)

merged = df1.join(df2, how='left')

Next, groupby on the index and get the aggregate sum for vol1. Since we need the aggregate aligned to each row, in pandas the transform helps with that:

grouped = merged.groupby(index)
partitioned_sum = grouped.vol1.transform('sum')

From here, we can create vol_r and vol_t via the assign method, and drop the vol1 column:

(merged.assign(vol_r = merged.vol1.div(partitioned_sum), 
               vol_t = lambda df: df.vol_r.mul(df.vol2))
       .drop(columns='vol1')
       .reset_index()
)

   sale  d_id  month  p_id    vol2     vol_r     vol_t
0     2   580      4     9  11.000  0.084653  0.931185
1     2   580      4     9  11.000  0.087070  0.957766
2     2   580      4     9  11.000  0.161611  1.777716
3     2   580      4     9  11.314  0.084653  0.957766
4     2   580      4     9  11.314  0.087070  0.985106
5     2   580      4     9  11.314  0.161611  1.828462
6     2   580      4     9  20.065  0.084653  1.698566
7     2   580      4     9  20.065  0.087070  1.747052
8     2   580      4     9  20.065  0.161611  3.242716
Sign up to request clarification or add additional context in comments.

4 Comments

here your output differs right, I have 3 rows in my output you have 9. I understand that when you merge as there are same rows on which we are joining but it doesn't happen when I execute sql
i ran your sql code before running the pandas code and it returned 9 rows
can you please also tell me how will I assign this to a new column in dataframe
you assign a new column with assign. The pandas docs is your friend.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.