1

I searched online but found nothing on the problem I'm facing.

It seems that pandas.DataFrame operations on index with timezone-aware dates is order of magnitude slower than on regular datetimes.

here are the ipython timings.

first with standard datetimes :

import pandas as pd
import numpy as np

dates=pd.date_range('2010/01/01 00:00:00', '2010/12/31 00:00:00', freq='1T')
DF=pd.DataFrame(data=np.random.rand(len(dates)), index=dates, columns=["value"])

# compute timedeltas between dates
%timeit DF["temp"] = DF.index
%timeit DF["deltas"] = (DF["temp"] - DF["temp"].shift())

results are :

1000 loops, best of 3: 1.13 ms per loop
100 loops, best of 3: 17.1 ms per loop

so far, so good.

now just adding timezone information :

import pandas as pd
import numpy as np

dates=pd.date_range('2010/01/01 00:00:00', '2010/12/31 00:00:00', freq='1T')
# NEW: filter dates to avoid DST problems
dates=dates[dates.hour>2] # to avoid AmbiguousInferError or NonExistentDateError

DF=pd.DataFrame(data=np.random.rand(len(dates)), index=dates, columns=["value"])

# NEW: add timezone info
DF.index = DF.index.tz_localize(tz="America/New_York", ambiguous="infer")

# compute timedeltas between dates
%timeit DF["temp"] = DF.index
%timeit DF["deltas"] = (DF["temp"] - DF["temp"].shift())

and now, results are :

1 loops, best of 3: 5.43 s per loop
1 loops, best of 3: 16 s per loop

why is that ??
I really don't understand where is the bottleneck here...

for info (from conda list) :

anaconda                  2.2.0                np19py34_0  
conda                     3.12.0                   py34_0  

numpy                     1.9.2                    py34_0  
pandas                    0.16.1               np19py34_0  
pytz                      2015.4                   py34_0  
scipy                     0.15.1               np19py34_0  
1
  • if i just set timezone to utc with tz_localize(tz=pytz.utc), timings are : 11.4 sec, whereas the dates didn't change at all from standard datetimes. Commented May 21, 2015 at 22:42

1 Answer 1

2

This is a known issue, see here. Datetimes with a naive tz (e.g. NO timezone) Series are efficiently represented with a dtype of datetime64[ns]. Calculations using int64's and so are pretty fast. tz-aware Series are represented using object dtype. These calculations are quite a bit slower.

It IS possible to fix this (see the referenced issue), to have a uniform tz-aware Series. Pull-requests are welcome!

In [9]: df = DataFrame({'datetime' : pd.date_range('20130101',periods=5), 'datetime_with_tz' : pd.date_range('20130101',periods=5,tz='US/Eastern')})

In [10]: df 
Out[10]: 
    datetime           datetime_with_tz
0 2013-01-01  2013-01-01 00:00:00-05:00
1 2013-01-02  2013-01-02 00:00:00-05:00
2 2013-01-03  2013-01-03 00:00:00-05:00
3 2013-01-04  2013-01-04 00:00:00-05:00
4 2013-01-05  2013-01-05 00:00:00-05:00

In [11]: df.dtypes
Out[11]: 
datetime            datetime64[ns]
datetime_with_tz            object
dtype: object
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.