Numpy: Use vectorization for loop while referring to previous row value?

Question

I have the following dataframe for which I want to create a column named 'Value' using numpy for fast looping and at the same time refer to the previous row value in the same column.

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "Product": ["A", "A", "A", "A", "B", "B", "B", "C", "C"],
        "Inbound": [115, 220, 200, 402, 313, 434, 321, 343, 120],
        "Outbound": [10, 20, 24, 52, 40, 12, 43, 23, 16],
        "Is First?": ["Yes", "No", "No", "No", "Yes", "No", "No", "Yes", "No"],
    }
)

  Product  Inbound  Outbound Is First?  Value
0       A      115        10       Yes    125
1       A      220        20        No    105
2       A      200        24        No     81
3       A      402        52        No     29
4       B      313        40       Yes    353
5       B      434        12        No    341
6       B      321        43        No    298
7       C      343        23       Yes    366
8       C      120        16        No    350

The formula for Value column in pseudocode is:

if ['Is First?'] = 'Yes' then [Value] = [Inbound] + [Outbound]
else [Value] = [Previous Value] - [Outbound]

The ideal way of creating the Value column right now is to do a for loop and use shift to refer to the previous column (which I am somehow not able to make work). But since I will be applying this over a giant dataset, I want to use the numpy vectorization method on it.

for i in range(len(df)):
    if df.loc[i, "Is First?"] == "Yes":
        df.loc[i, "Value"] = df.loc[i, "Inbound"] + df.loc[i, "Outbound"]
    else:
        df.loc[i, "Value"] = df.loc[i, "Value"].shift(-1) + df.loc[i, "Outbound"]

does a "yes" always go together with an other product name?

Chachni
– Chachni

2019-08-30 23:13:30 +00:00
Commented Aug 30, 2019 at 23:13 — Chachni
– Chachni, Commented Aug 30, 2019 at 23:13

Andy L. · Accepted Answer · 2019-08-31 00:06:09Z

2

One way:
You may use np.subtract.accumulate with transform

s = df['Is First?'].eq('Yes').cumsum()
df['value'] = ((df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'), df.Outbound)
                                         .groupby(s)
                                         .transform(np.subtract.accumulate))

Out[1749]:
  Product  Inbound  Outbound Is First?  value
0       A      115        10       Yes    125
1       A      220        20        No    105
2       A      200        24        No     81
3       A      402        52        No     29
4       B      313        40       Yes    353
5       B      434        12        No    341
6       B      321        43        No    298
7       C      343        23       Yes    366
8       C      120        16        No    350

Another way:
Assign value for Yes. Create groupid s to use for groupby. Groupby and shift Outbound to calculate cumsum, and subtract it from 'Yes' value of each group. Finally, use it to fillna.

df['value'] = (df.Inbound + df.Outbound).where(df['Is First?'].eq('Yes'))
s = df['Is First?'].eq('Yes').cumsum()
s1 = df.value.ffill() - df.Outbound.shift(-1).groupby(s).cumsum().shift()
df['value'] = df.value.fillna(s1)

Out[1671]:
  Product  Inbound  Outbound Is First?  value
0       A      115        10       Yes  125.0
1       A      220        20        No  105.0
2       A      200        24        No   81.0
3       A      402        52        No   29.0
4       B      313        40       Yes  353.0
5       B      434        12        No  341.0
6       B      321        43        No  298.0
7       C      343        23       Yes  366.0
8       C      120        16        No  350.0

edited Aug 31, 2019 at 0:06

answered Aug 30, 2019 at 22:22

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Chachni Over a year ago

Could make it even short if a new product always corresponds to a Yes:

df.loc[df['Is First?'].eq('Yes'),'Value'] = df.Inbound + df.Outbound df.loc[df['Is First?'].eq('No'), 'Value'] = df.Value.ffill()-df.Outbound.shift(-1).groupby(df.Product).cumsum().shift()

Andy L. Over a year ago

ah, I see what you mean. I did consider df.Product for groupby. However, I decided against it because OP's logic never says about it. His logic solely mentions only values of 'Is First?', so I have to create s to use for groupby.

Mark Wang · Accepted Answer · 2019-08-30 21:37:15Z

1

This is not a trivial task, the difficulty lies in the consecutive Nos. It's necessary to group consecutive no's together, the code below should do,

col_sum = df.Inbound+df.Outbound

mask_no = df['Is First?'].eq('No')

mask_yes = df['Is First?'].eq('Yes')

consec_no = mask_yes.cumsum()

result = col_sum.groupby(consec_no).transform('first')-df['Outbound'].where(mask_no,0).groupby(consec_no).cumsum()

answered Aug 30, 2019 at 21:37

Mark Wang

2,7579 silver badges18 bronze badges

Comments

ansev · Accepted Answer · 2019-08-30 21:57:46Z

1

Use:

df.loc[df['Is First?'].eq('Yes'),'Value']=df['Inbound']+df['Outbound']
df.loc[~df['Is First?'].eq('Yes'),'Value']=df['Value'].fillna(0).shift().cumsum()-df.loc[~df['Is First?'].eq('Yes'),'Outbound'].cumsum()

edited Aug 30, 2019 at 21:57

answered Aug 30, 2019 at 21:45

ansev

31k5 gold badges21 silver badges33 bronze badges

1 Comment

Andy L. Over a year ago

This is wrong because the 1st cumsum is calculated across 'Yes' groups.

Paul Panzer · Accepted Answer · 2019-08-30 22:55:48Z

Annotated numpy code:

## 1. line up values to sum

ob = -df["Outbound"].values
# get yes indices
fi, = np.where(df["Is First?"].values == "Yes")
# insert yes formula at yes positions
ob[fi] = df["Inbound"].values[fi] - ob[fi]

## 2. calculate block sums and subtract each from the
## first element of the **next** block

ob[fi[1:]] -= np.add.reduceat(ob,fi)[:-1]
# now simply taking the cumsum will reset after each block
df["Value"] = ob.cumsum()

Result:

  Product  Inbound  Outbound Is First?  Value
0       A      115        10       Yes    125
1       A      220        20        No    105
2       A      200        24        No     81
3       A      402        52        No     29
4       B      313        40       Yes    353
5       B      434        12        No    341
6       B      321        43        No    298
7       C      343        23       Yes    366
8       C      120        16        No    350

Collectives™ on Stack Overflow

Numpy: Use vectorization for loop while referring to previous row value?

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related