1

I am executing the following code for different time stamps and each will have close to one million records. It took more than one hour for one date and I have the data for a total of 35 dates.

Is there a way to optimize this code?

def median(a, b, c,d,e):
    I=[a,b,c,d,e]
    I.sort()
    return I[2]

for i in range(2, len(df['num'])-2):
    num_smooth= median(df['num'][i-1], df['num'][i-2], df['num'][i],
                       df['num'][i+1], df['num'][i+2])
    df.set_value(i,'num_smooth',num_smooth)
df['num_smooth'].fillna(df['num'], inplace=True)

...........................................
Remaining code
2
  • Have you tried profiling your code? Nothing in the code you posted should even take remotely close to 1 hour, for even 100mil+ records. Commented Oct 12, 2016 at 18:05
  • The code has a few other calculations also. It was quick before I have included this piece of code Commented Oct 12, 2016 at 18:10

2 Answers 2

4

I'm guessing that your df is a Pandas DataFrame object. Pandas has built-in functionality to compute rolling statistics, including a rolling median. This functionality is available via the rolling method on Pandas Series and DataFrame objects.

>>> s = pd.Series(np.random.rand(10))
>>> s
0    0.500538
1    0.598179
2    0.747391
3    0.371498
4    0.244869
5    0.930303
6    0.327856
7    0.317395
8    0.190386
9    0.976148
dtype: float64
>>> s.rolling(window=5, center=True).median()
0         NaN
1         NaN
2    0.500538
3    0.598179
4    0.371498
5    0.327856
6    0.317395
7    0.327856
8         NaN
9         NaN
dtype: float64

See the Pandas documentation on Window Functions for more general information on using rolling and related functionality. As a general rule, when performance matters you should prefer using built-in Pandas and NumPy functions and methods over explicit Python-level for loops, though as always, you should profile your solutions to be sure. On my machine, working with a df['num'] series containing one million random floats, the rolling-based solution takes approximately 129 seconds, while the for-loop based solution takes around 0.61 seconds, so using rolling speeds the code up by a factor of over 200.

So in your case,

df['num_smooth'] = df['num'].rolling(window=5, center=True).median()

along with the fillna step that you already have should give you something close to what you need.

Note that the syntax for computing rolling statistics changed in Pandas 0.18, so you'll need at least version 0.18 to use the above code. For earlier versions of Pandas, look into the rolling_median function.

Sign up to request clarification or add additional context in comments.

1 Comment

Yes, exactly. Was just about to post this.
0

A nice tool for profiling python code performance line by line is kernprof.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.