How to best perform recursion on a pandas dataframe column

Question

I am trying to calculate an index value over a time series within a pandas dataframe. This index depends on the previous row's result to calculate each row after the first iteration. I've attempted to do this recursively, within iteration over the dataframe's rows, but I find that the first two rows of the calculation are correct, but the third and subsequent rows are inaccurate.

I think this is because after the initial value, subsquent index calculations are going wrong and then set all other subsequent calculations wrong.

What is causing this inaccuracy. Is there a better approach than the one I've taken?

A sample of the output looks like this:


ticket_cat   Sector   Year       factor        Incorrect_index_value  correct_index_value     prev_row
Revenue      LSE      Jan 2004                 100.00                 100.00                  
Revenue      LSE      Jan 2005   4.323542894   104.3235               104.3235                100.00
Revenue      LSE      Jan 2006   3.096308080   98.823                 107.5537      <--incorrect row        
Revenue      LSE      Jan 2007   6.211666      107.476                114.2345  <--incorrect row              
Revenue      LD       Jan 2004                 100.00                 100.0000
Revenue      LD       Jan 2005   3.5218        103.5218               103.5218
Revenue      LD       Jan 2006   2.7417        99.2464                106.3602   <--- incorrect row
Revenue      LD       Jan 2007   3.3506        104.1353               109.9239  <--- incorrect row

The code snippet I have is as follows: stpassrev is the dataframe

#insert initial value for index
stpassrev['index_value'] = np.where(
       (stpassrev['Year'] == 'Jan 2004' ) & (stpassrev['Ticket_cat']=='Revenue'),
        100.00,np.nan )

#set up initial values for prec_row column
stpassrev['prev_row'] = np.where(
              #only have relevant row impacted
                (stpassrev['Year'] == 'Jan 2005' ) & (stpassrev['Ticke_cat']=='Revenue'),
        100.00,
        np.nan
        )

#calculate the index_value
for i in range(1,len(stpassrev)):
        stpassrev.loc[i,'passrev'] = np.where(
            (stpassrev.loc[i,'Ticket_cat']=='Revenue'  )  & (pd.isna(stpassrev.loc[i,'factor'])==False),
                ((100+stpassrev.loc[i,'factor'] ) /stpassrev.loc[i-1,'index_value'])*100,
                stpassrev.loc[i,'index_value'])


     stpassrev.loc[i,'prev_row'] = stpassrev.loc[i-1,'index_value']

What does your original input contain? ticket_cat, Year and factor? and you're trying to derive the index_value based on the factor and previous index_value? — r.ook
– r.ook, Commented Nov 18, 2019 at 14:43
@Aryerez I hope to have a steadily increasing value as the index always increases — Greg Williams
– Greg Williams, Commented Nov 18, 2019 at 14:52
@QuangHoang yes, factor is only NULL as the beginning. The first row is the initial index value which is 100. — Greg Williams
– Greg Williams, Commented Nov 18, 2019 at 14:53

r.ook · Accepted Answer · 2019-11-18 16:41:23Z

Based on your updated question, you just need to do this:

# assign a new temp_factor with initial values and prep for cumprod
stpassrev['temp_factor'] = np.where(stpassrev['factor'].isna(), 1, stpassrev['factor'].add(100).div(100))

# calculate the cumprod based on the temp_factor (grouped by Sector) and multiply by 100 for index_value
stpassrev['index_value'] = stpassrev.groupby('Sector')['temp_factor'].cumprod().mul(100)

Results:

  ticket_cat Sector      Year    factor  temp_factor  index_value
0    Revenue    LSE  Jan 2004       NaN     1.000000   100.000000
1    Revenue    LSE  Jan 2005  4.323543     1.043235   104.323543
2    Revenue    LSE  Jan 2006  3.096308     1.030963   107.553721
3    Revenue    LSE  Jan 2007  6.211666     1.062117   114.234599
4    Revenue     LD  Jan 2004       NaN     1.000000   100.000000
5    Revenue     LD  Jan 2005  3.521800     1.035218   103.521800
6    Revenue     LD  Jan 2006  2.741700     1.027417   106.360057
7    Revenue     LD  Jan 2007  3.350600     1.033506   109.923757

If you need it rounded to 4 digit precision, add .round(4) after the .mul(100):

stpassrev['index_value'] = stpassrev.groupby('Sector')['temp_factor'].cumprod().mul(100).round(4)

  ticket_cat Sector      Year    factor  temp_factor  index_value
0    Revenue    LSE  Jan 2004       NaN     1.000000     100.0000
1    Revenue    LSE  Jan 2005  4.323543     1.043235     104.3235
2    Revenue    LSE  Jan 2006  3.096308     1.030963     107.5537
3    Revenue    LSE  Jan 2007  6.211666     1.062117     114.2346
4    Revenue     LD  Jan 2004       NaN     1.000000     100.0000
5    Revenue     LD  Jan 2005  3.521800     1.035218     103.5218
6    Revenue     LD  Jan 2006  2.741700     1.027417     106.3601
7    Revenue     LD  Jan 2007  3.350600     1.033506     109.9238

This got it! I see that recursion was unnecessary as the cumulative product has the effect of recalculating every new row anew rather than using the previous result. Very elegant! Thanks to you both.

Quang Hoang · Accepted Answer · 2019-11-18 16:51:25Z

0

Your problem is you were performing division, while you actually need multiplication. Your code can be simplified to just:

df['index_value'] = (df.factor.fillna(0)
                       .div(100).add(1)
                       .groupby(df['Sector'])
                       .cumprod()
                    )

Output:

  ticket_cat Sector      Year    factor  index_value
0    Revenue    LSE  Jan 2004       NaN     1.000000
1    Revenue    LSE  Jan 2005  4.323543     1.043235
2    Revenue    LSE  Jan 2006  3.096308     1.075537
3    Revenue    LSE  Jan 2007  6.211666     1.142346
4    Revenue     LD  Jan 2004       NaN     1.000000
5    Revenue     LD  Jan 2005  3.521800     1.035218
6    Revenue     LD  Jan 2006  2.741700     1.063601
7    Revenue     LD  Jan 2007  3.350600     1.099238

edited Nov 18, 2019 at 16:51

answered Nov 18, 2019 at 15:06

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

2 Comments

r.ook Over a year ago

Besides the new_index_val not being *100, I think you got it.

Greg Williams Over a year ago

My apologies I haven't explained this clearly enough. I've edited the original question to show the expected values against the current incorrect output. The given answer replicates the incorrect/undesired output.

Collectives™ on Stack Overflow

How to best perform recursion on a pandas dataframe column

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related