Optimization of for loop in python

Question

I am executing the following code for different time stamps and each will have close to one million records. It took more than one hour for one date and I have the data for a total of 35 dates.

Is there a way to optimize this code?

def median(a, b, c,d,e):
    I=[a,b,c,d,e]
    I.sort()
    return I[2]

for i in range(2, len(df['num'])-2):
    num_smooth= median(df['num'][i-1], df['num'][i-2], df['num'][i],
                       df['num'][i+1], df['num'][i+2])
    df.set_value(i,'num_smooth',num_smooth)
df['num_smooth'].fillna(df['num'], inplace=True)

...........................................
Remaining code

Have you tried profiling your code? Nothing in the code you posted should even take remotely close to 1 hour, for even 100mil+ records. — tcooc
– tcooc, Commented Oct 12, 2016 at 18:05
The code has a few other calculations also. It was quick before I have included this piece of code — Prasad
– Prasad, Commented Oct 12, 2016 at 18:10

Mark Dickinson · Accepted Answer · 2016-10-12 19:23:56Z

I'm guessing that your df is a Pandas DataFrame object. Pandas has built-in functionality to compute rolling statistics, including a rolling median. This functionality is available via the rolling method on Pandas Series and DataFrame objects.

>>> s = pd.Series(np.random.rand(10))
>>> s
0    0.500538
1    0.598179
2    0.747391
3    0.371498
4    0.244869
5    0.930303
6    0.327856
7    0.317395
8    0.190386
9    0.976148
dtype: float64
>>> s.rolling(window=5, center=True).median()
0         NaN
1         NaN
2    0.500538
3    0.598179
4    0.371498
5    0.327856
6    0.317395
7    0.327856
8         NaN
9         NaN
dtype: float64

See the Pandas documentation on Window Functions for more general information on using rolling and related functionality. As a general rule, when performance matters you should prefer using built-in Pandas and NumPy functions and methods over explicit Python-level for loops, though as always, you should profile your solutions to be sure. On my machine, working with a df['num'] series containing one million random floats, the rolling-based solution takes approximately 129 seconds, while the for-loop based solution takes around 0.61 seconds, so using rolling speeds the code up by a factor of over 200.

So in your case,

df['num_smooth'] = df['num'].rolling(window=5, center=True).median()

along with the fillna step that you already have should give you something close to what you need.

Note that the syntax for computing rolling statistics changed in Pandas 0.18, so you'll need at least version 0.18 to use the above code. For earlier versions of Pandas, look into the rolling_median function.

asc11 · Accepted Answer · 2016-10-12 18:56:31Z

0

A nice tool for profiling python code performance line by line is kernprof.

answered Oct 12, 2016 at 18:56

asc11

4593 silver badges7 bronze badges

Collectives™ on Stack Overflow

Optimization of for loop in python

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related