0

I have the following loops that takes more than 9 seconds for 10 000 loops. For my program, I have to execute more than 1000 times this function. I need some help to optimize the "simu" function as from now my code is impossible to use since the time duration. For info, daterange values are only for example but can be very different from one to others.

What take mostly time :

  • df.itertuples(['DATES'])
  • loop even using iterator
  • if condition
  • f.index.get_loc to have the position of the date

Has someone any idea how to optimize this code ?

def simu(nbprod, df, daterange):


    timer = time.time()
    mat = np.zeros((len(df), nbprod))

    iterator = ((i,j) for j in xrange(len(daterange)) for i in df.itertuples(['DATES']))

    for (i,j) in iterator:
        thedate = i[0]
        if (thedate >= daterange[j][0]) and (thedate <= daterange[j][1]):
            mat[df.index.get_loc(i[0])][j] = 1

    print time.time() - timer

    return mat


new_index = pd.date_range(start=pd.datetime(2014,1,1), periods=24*10000, freq='H')
df = pd.DataFrame(np.random.randn(len(new_index)), new_index)
df.index.name = 'DATES'

daterange = [[pd.datetime(2014,1,3), pd.datetime(2014,1,7)], [pd.datetime(2015,6,3), pd.datetime(2017,1,7)], [pd.datetime(2017,1,3), pd.datetime(2020,1,7)]]

### for 1 time
>>> simu(len(daterange), df, daterange)
9.43400001526

### for 3 times more
>>> simu(len(daterange)*3, df, daterange*3)
30.6919999123

>>> simu(len(daterange)*10, df, daterange*10)
92.2009999752
3
  • can you show what a sample daterange is? Commented Apr 1, 2014 at 21:43
  • yes sorry ! daterange = [[pd.datetime(2014,1,3), pd.datetime(2014,1,7)], [pd.datetime(2015,6,3), pd.datetime(2017,1,7)], [pd.datetime(2017,1,3), pd.datetime(2020,1,7)]] Commented Apr 1, 2014 at 21:44
  • Look here: pandas.pydata.org/pandas-docs/stable/enhancingperf.html <- I had a similar issue and by going to Cython I managed to go down from 132 seconds to 220 milliseconds for a large loop. Commented Apr 1, 2014 at 21:51

1 Answer 1

1

This returns a frame, which is IMHO more useful anyhow (if you want the underlying data, just df.values. This will scale linearly with the length of daterange.

def simu2(df, daterange):

    mat = pd.DataFrame(0,index=df.index,columns=range(len(daterange)))
    for j, (d1,d2) in enumerate(daterange):
        result = df[(df.index>=d1)&(df.index<=d2)]
        mat.loc[result.index,j] = 1

    return mat


In [7]: result1 = simu2(df, daterange)

In [10]: result2 = simu(len(daterange), df, daterange)
5.7844748497

In [11]: (result1.values==result2).all()
Out[11]: True

In [12]: %timeit simu2(df, daterange)
10 loops, best of 3: 162 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.