I have a spark dataframe that looks like this:
import pandas as pd
dt = pd.DataFrame({'id': ['a','a','a','a','a','a','b','b','b','b'],
'delta': [1,2,3,4,5,6,7,8,9,10],
'pos': [2,0,0,1,2,1,2,0,0,1],
'index': [1,2,3,4,5,6,1,2,3,4]})
I would like to sum the deltas from pos==2 until pos==1, for all the times that this occurs, by id
So I would like a column to the spark dataframe that will look like this:
[6, 0, 0, 0, 4, 0, 24, 0, 0, 0]
Explanation of result:
- 6 -> for
id'a', find the firstpos==2and sum all thedeltas until (not including) the nextpos==1, so 1+2+3 =6 - 0 this is not in
pos==2 - 0 this is not in
pos==2 - 0 this is not in
pos==2 - 4 -> for
id'a', find the nextpos==2and sum all thedeltas until (not including) the nextpos==1, so just 4 - 21 - > for
id'b', find the firstpos==2and sum all thedeltas until (not including) the nextpos==1, so 7+8+9=24
Any ideas how I can do that efficiently in pyspark ?
EDIT
The dataframe is order by index and id
deltacolumn?indexandid