Pandas loop optimization

Question

Is there a better way (performance-wise) for doing the following loop in pandas (assuming df is a DataFrame)?

for i in range(len(df)):
    if df['signal'].iloc[i] == 0:   # if the signal is negative
        if df['position'].iloc[i - 1] - 0.02 < -1:   # if the row above - 0.1 < -1 set the value of current row to -1
            df['position'].iloc[i] = -1
        else:   # if the new col value above -0.1 is > -1 then subtract 0.1 from that value
            df['position'].iloc[i] = df['position'].iloc[i - 1] - 0.02
    elif df['signal'].iloc[i] == 1:     # if the signal is positive
        if df['position'].iloc[i - 1] + 0.02 > 1:     # if the value above + 0.1 > 1 set the current row to 1
            df['position'].iloc[i] = 1
        else:   # if the row above + 0.1 < 1 then add 0.1 to the value of the current row
            df['position'].iloc[i] = df['position'].iloc[i - 1] + 0.02

I will be grateful for any advices because I just started going through Pandas route and, obviously, may miss something crucial.

Source CSV data:

Date,sp500,sp500 MA,UNRATE,UNRATE MA,signal,position
2000-01-01,,,4.0,4.191666666666665,1,0
2000-01-02,,,4.0,4.191666666666665,1,0
2000-01-03,102.93,95.02135,4.0,4.191666666666665,1,0
2000-01-04,98.91,95.0599,4.0,4.191666666666665,1,0
2000-01-05,99.08,95.11245000000001,4.0,4.191666666666665,1,0
2000-01-06,97.49,95.15450000000001,4.0,4.191666666666665,1,0
2000-01-07,103.15,95.21575000000001,4.0,4.191666666666665,1,0
2000-01-08,103.15,95.21575000000001,4.0,4.191666666666665,1,0
2000-01-09,103.15,95.21575000000001,4.0,4.191666666666665,1,0

Desired output:

Date,sp500,sp500 MA,UNRATE,UNRATE MA,signal,position
2000-01-01,,,4.0,4.191666666666665,1,0.02
2000-01-02,,,4.0,4.191666666666665,1,0.04
2000-01-03,102.93,95.02135,4.0,4.191666666666665,1,0.06
2000-01-04,98.91,95.0599,4.0,4.191666666666665,1,0.08
2000-01-05,99.08,95.11245000000001,4.0,4.191666666666665,1,0.1
2000-01-06,97.49,95.15450000000001,4.0,4.191666666666665,1,0.12
2000-01-07,103.15,95.21575000000001,4.0,4.191666666666665,1,0.14
2000-01-08,103.15,95.21575000000001,4.0,4.191666666666665,1,0.16
2000-01-09,103.15,95.21575000000001,4.0,4.191666666666665,1,0.18

Update All the answers below (by the moment I am writing this) produce constant position 0.02 value which differs from my naive loop approach. In other words I am looking for a solution which would give 0.02, 0.04, 0.06, 0.08 etc for the position column.

if you're looping with pandas, you're almost always doing it wrong — SuperStew
– SuperStew, Commented Jul 25, 2018 at 15:44
Can you add example of input and desired output? Something like minimal reproducible example. — zipa
– zipa, Commented Jul 25, 2018 at 15:50
@varnie: what most people have missed is that the nth row of your output doesn't depend upon the n-1st row of your input, but the n-1st row of your output, and so can't be trivially decomposed into shifts. — DSM
– DSM, Commented Jul 25, 2018 at 18:53
If you have a working solution which contains simple loops, create a solution which only depends on numpy arrays like @Jonas Byström did and then use a compiler like Numba or Cython. eg. stackoverflow.com/a/50969037/4045774 — max9111
– max9111, Commented Jul 26, 2018 at 10:41

jpp · Accepted Answer · 2018-07-25 16:08:07Z

2

Don't use a loop. Pandas specializes in vectorised operations, e.g. for signal == 0:

pos_shift = df['position'].shift() - 0.02
m1 = df['signal'] == 0
m2 = pos_shift < -1

df.loc[m1 & m2, 'position'] = -1
df['position'] = np.where(m1 & ~m2, pos_shift, df['position'])

You can write something similar for signal == 1.

edited Jul 25, 2018 at 16:08

answered Jul 25, 2018 at 15:57

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

varnie Over a year ago

Thanks. It looks amazing, but I just noticed that the results your version produces are a bit different than I have from my initial code.

Jon Clements Over a year ago

@varnie that's why it'd be really handy if you edited your question to include some sample input and output :)

varnie Over a year ago

@JonClements okay, tried to provide some input and output (updated my question).

varnie Over a year ago

@jpp from my tests it looks like your version produces the same position: 0.02 for all rows excepr the first one (the first one is NaN), but in my version it gets increased by 0.02 step for each row.

jpp Over a year ago

@varnie, To be honest, I'd focus on the logic first rather than the result. Python / Pandas (usually) does what you tell it to do :). Is there a bit you don't understand? pd.Series.shift will have NaN in the first row, sure. But you can special case that if it's a problem.

|

ak_slick · Accepted Answer · 2018-07-25 23:46:13Z

Thanks for adding data and example output. First off I am pretty sure you cannot vectorize this as each calculation is dependent on the output of the previous one. So this is the best I was able to do.

Your method came in around 0.116999 seconds on my machine

This one came in around 0.0039999 seconds

Not vectorized but it gets a good speed increase since it is faster to use a list for this and adding it back to the data frame at the end.

def myfunc(pos_pre, signal):
    if signal == 0:  # if the signal is negative
        # if the new col value above -0.2 is > -1 then subtract 0.2 from that value
        pos = pos_pre - 0.02
        if pos < -1:  # if the row above - 0.2 < -1 set the value of current row to -1
            pos = -1

    elif signal == 1:
        # if the row above + 0.2 < 1 then add 0.2 to the value of the current row
        pos = pos_pre + 0.02
        if pos > 1:  # if the value above + 0.1 > 1 set the current row to 1
            pos = 1

    return pos


''' set first position value because you aren't technically calculating it correctly in your method since there is no 
position minus 1... IE: it will always be 0.02'''
new_pos = [0.02]

# skip index zero since there is no position 0 minus 1
for i in range(1, len(df)):
    new_pos.append(myfunc(pos_pre=new_pos[i-1], signal=df['signal'].iloc[i]))

df['position'] = new_pos

Output:

df.position
0    0.02
1    0.04
2    0.06
3    0.08
4    0.10
5    0.12
6    0.14
7    0.16
8    0.18

Jonas Byström · Accepted Answer · 2018-07-25 15:50:44Z

0

Yep. When looking for performance, you should always operate on the underlying numpy arrays:

signal = df['signal'].values
position = df['position'].values
for i in range(len(df)):
    if signal[i] == 0:
        if position[i-1]-0.02 < -1:
            position[i] = -1
        else:
            position[i] = position[i-1]-0.02
    elif signal[i] == 1:
        if position[i-1]+0.02 > 1:
            position[i] = 1
        else:
            position[i] = position[i-1]+0.02

You'll be surprised at the performance gain, often times 10x or more.

answered Jul 25, 2018 at 15:50

Jonas Byström

26.4k23 gold badges106 silver badges154 bronze badges

1 Comment

user3483203 Over a year ago

This still iterates in the same way the question does. The primary benefit of operating on numpy arrays is making use of vectorized operations.

Ashish Acharya · Accepted Answer · 2018-07-25 15:54:20Z

0

There are most likely better ways, but this one should work too:

df['previous'] = df.signal.shift()

def get_signal_value(row):
    if row.signal == 0:
        compare = row.previous - 0.02
        if compare < -1:
            return -1
        else:
            return compare
    elif row.signal == 1: 
        compare = row.previous + 0.01
        if compare > 1:
            return 1
        else:
            return compare

df['new_signal'] = df.apply(lambda row: get_signal_value(row), axis=1)

answered Jul 25, 2018 at 15:54

Ashish Acharya

3,4091 gold badge19 silver badges25 bronze badges

Collectives™ on Stack Overflow

Pandas loop optimization

4 Answers 4

9 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

9 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related