1

Is there a better way (performance-wise) for doing the following loop in pandas (assuming df is a DataFrame)?

for i in range(len(df)):
    if df['signal'].iloc[i] == 0:   # if the signal is negative
        if df['position'].iloc[i - 1] - 0.02 < -1:   # if the row above - 0.1 < -1 set the value of current row to -1
            df['position'].iloc[i] = -1
        else:   # if the new col value above -0.1 is > -1 then subtract 0.1 from that value
            df['position'].iloc[i] = df['position'].iloc[i - 1] - 0.02
    elif df['signal'].iloc[i] == 1:     # if the signal is positive
        if df['position'].iloc[i - 1] + 0.02 > 1:     # if the value above + 0.1 > 1 set the current row to 1
            df['position'].iloc[i] = 1
        else:   # if the row above + 0.1 < 1 then add 0.1 to the value of the current row
            df['position'].iloc[i] = df['position'].iloc[i - 1] + 0.02

I will be grateful for any advices because I just started going through Pandas route and, obviously, may miss something crucial.

Source CSV data:

Date,sp500,sp500 MA,UNRATE,UNRATE MA,signal,position
2000-01-01,,,4.0,4.191666666666665,1,0
2000-01-02,,,4.0,4.191666666666665,1,0
2000-01-03,102.93,95.02135,4.0,4.191666666666665,1,0
2000-01-04,98.91,95.0599,4.0,4.191666666666665,1,0
2000-01-05,99.08,95.11245000000001,4.0,4.191666666666665,1,0
2000-01-06,97.49,95.15450000000001,4.0,4.191666666666665,1,0
2000-01-07,103.15,95.21575000000001,4.0,4.191666666666665,1,0
2000-01-08,103.15,95.21575000000001,4.0,4.191666666666665,1,0
2000-01-09,103.15,95.21575000000001,4.0,4.191666666666665,1,0

Desired output:

Date,sp500,sp500 MA,UNRATE,UNRATE MA,signal,position
2000-01-01,,,4.0,4.191666666666665,1,0.02
2000-01-02,,,4.0,4.191666666666665,1,0.04
2000-01-03,102.93,95.02135,4.0,4.191666666666665,1,0.06
2000-01-04,98.91,95.0599,4.0,4.191666666666665,1,0.08
2000-01-05,99.08,95.11245000000001,4.0,4.191666666666665,1,0.1
2000-01-06,97.49,95.15450000000001,4.0,4.191666666666665,1,0.12
2000-01-07,103.15,95.21575000000001,4.0,4.191666666666665,1,0.14
2000-01-08,103.15,95.21575000000001,4.0,4.191666666666665,1,0.16
2000-01-09,103.15,95.21575000000001,4.0,4.191666666666665,1,0.18

Update All the answers below (by the moment I am writing this) produce constant position 0.02 value which differs from my naive loop approach. In other words I am looking for a solution which would give 0.02, 0.04, 0.06, 0.08 etc for the position column.

11
  • 2
    if you're looping with pandas, you're almost always doing it wrong Commented Jul 25, 2018 at 15:44
  • @SuperStew yes, I had such gut feeling Commented Jul 25, 2018 at 15:46
  • 2
    Can you add example of input and desired output? Something like minimal reproducible example. Commented Jul 25, 2018 at 15:50
  • 1
    @varnie: what most people have missed is that the nth row of your output doesn't depend upon the n-1st row of your input, but the n-1st row of your output, and so can't be trivially decomposed into shifts. Commented Jul 25, 2018 at 18:53
  • 1
    If you have a working solution which contains simple loops, create a solution which only depends on numpy arrays like @Jonas Byström did and then use a compiler like Numba or Cython. eg. stackoverflow.com/a/50969037/4045774 Commented Jul 26, 2018 at 10:41

4 Answers 4

2

Don't use a loop. Pandas specializes in vectorised operations, e.g. for signal == 0:

pos_shift = df['position'].shift() - 0.02
m1 = df['signal'] == 0
m2 = pos_shift < -1

df.loc[m1 & m2, 'position'] = -1
df['position'] = np.where(m1 & ~m2, pos_shift, df['position'])

You can write something similar for signal == 1.

Sign up to request clarification or add additional context in comments.

9 Comments

Thanks. It looks amazing, but I just noticed that the results your version produces are a bit different than I have from my initial code.
@varnie that's why it'd be really handy if you edited your question to include some sample input and output :)
@JonClements okay, tried to provide some input and output (updated my question).
@jpp from my tests it looks like your version produces the same position: 0.02 for all rows excepr the first one (the first one is NaN), but in my version it gets increased by 0.02 step for each row.
@varnie, To be honest, I'd focus on the logic first rather than the result. Python / Pandas (usually) does what you tell it to do :). Is there a bit you don't understand? pd.Series.shift will have NaN in the first row, sure. But you can special case that if it's a problem.
|
1

Thanks for adding data and example output. First off I am pretty sure you cannot vectorize this as each calculation is dependent on the output of the previous one. So this is the best I was able to do.

Your method came in around 0.116999 seconds on my machine

This one came in around 0.0039999 seconds

Not vectorized but it gets a good speed increase since it is faster to use a list for this and adding it back to the data frame at the end.

def myfunc(pos_pre, signal):
    if signal == 0:  # if the signal is negative
        # if the new col value above -0.2 is > -1 then subtract 0.2 from that value
        pos = pos_pre - 0.02
        if pos < -1:  # if the row above - 0.2 < -1 set the value of current row to -1
            pos = -1

    elif signal == 1:
        # if the row above + 0.2 < 1 then add 0.2 to the value of the current row
        pos = pos_pre + 0.02
        if pos > 1:  # if the value above + 0.1 > 1 set the current row to 1
            pos = 1

    return pos


''' set first position value because you aren't technically calculating it correctly in your method since there is no 
position minus 1... IE: it will always be 0.02'''
new_pos = [0.02]

# skip index zero since there is no position 0 minus 1
for i in range(1, len(df)):
    new_pos.append(myfunc(pos_pre=new_pos[i-1], signal=df['signal'].iloc[i]))

df['position'] = new_pos

Output:

df.position
0    0.02
1    0.04
2    0.06
3    0.08
4    0.10
5    0.12
6    0.14
7    0.16
8    0.18

Comments

0

Yep. When looking for performance, you should always operate on the underlying numpy arrays:

signal = df['signal'].values
position = df['position'].values
for i in range(len(df)):
    if signal[i] == 0:
        if position[i-1]-0.02 < -1:
            position[i] = -1
        else:
            position[i] = position[i-1]-0.02
    elif signal[i] == 1:
        if position[i-1]+0.02 > 1:
            position[i] = 1
        else:
            position[i] = position[i-1]+0.02

You'll be surprised at the performance gain, often times 10x or more.

1 Comment

This still iterates in the same way the question does. The primary benefit of operating on numpy arrays is making use of vectorized operations.
0

There are most likely better ways, but this one should work too:

df['previous'] = df.signal.shift()

def get_signal_value(row):
    if row.signal == 0:
        compare = row.previous - 0.02
        if compare < -1:
            return -1
        else:
            return compare
    elif row.signal == 1: 
        compare = row.previous + 0.01
        if compare > 1:
            return 1
        else:
            return compare

df['new_signal'] = df.apply(lambda row: get_signal_value(row), axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.