Querying Python Pandas DataFrame with a Datetime index or column

Question

So, I'm new about the package Pandas. I was doing some back test on a strategy on ETFs, that I need to do a lot of queries on Pandas Dataframe.

So let's say I'm these two DataFrames, df and df1, the only difference is that: df has datetime Index, while df1 has the timestamp as a column and an integer Index

In[104]: df.head()
Out[104]: 

                       high     low    open   close   volume  openInterest
2007-04-24 09:31:00  148.28  148.12  148.23  148.15  2304400        341400
2007-04-24 09:32:00  148.21  148.14  148.14  148.19  2753500        449100
2007-04-24 09:33:00  148.24  148.13  148.18  148.14  2863400        109900
2007-04-24 09:34:00  148.18  148.12  148.13  148.16  3118287        254887
2007-04-24 09:35:00  148.17  148.14  148.16  148.16  3202112         83825

In[105]: df1.head()
Out[105]: 

                dates    high     low    open   close   volume  openInterest
0 2007-04-24 09:31:00  148.28  148.12  148.23  148.15  2304400        341400
1 2007-04-24 09:32:00  148.21  148.14  148.14  148.19  2753500        449100
2 2007-04-24 09:33:00  148.24  148.13  148.18  148.14  2863400        109900
3 2007-04-24 09:34:00  148.18  148.12  148.13  148.16  3118287        254887
4 2007-04-24 09:35:00  148.17  148.14  148.16  148.16  3202112         83825

so I test the query speed a little bit:

In[100]: %timeit df1[(df1['dates'] >= '2015-11-17') & (df1['dates'] < '2015-11-18')]
%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
100 loops, best of 3: 4.67 ms per loop
100 loops, best of 3: 3.14 ms per loop
1 loop, best of 3: 259 ms per loop

To my surprise is that using the logic built in with Pandas is actually the slowest:

df.loc['2015-11-17']

Does anyone know why is that? And are there any documents or blogs about the most efficient ways to query a Pandas DataFrame?

Steven G · Accepted Answer · 2016-11-30 21:14:14Z

4

If I were you I would use the simpler method:

df['2015-11-17']

in my opinion this would be more 'pandas logic' than using .loc[] for a single date. I am guessing it is also faster.

testing on a minute OHLC dataframe:

%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
%timeit df['2015-11-17']

100 loops, best of 3: 13.8 ms per loop
1 loop, best of 3: 1.39 s per loop
1000 loops, best of 3: 486 us per loop

edited Nov 30, 2016 at 21:14

answered Nov 30, 2016 at 21:03

Steven G

17.3k11 gold badges57 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

qichao_he Over a year ago

Thanks, man! I tested it as well, the simplest method is the fastest, as it should be. Do you know where I can get a whole picture about different methods of querying with Pandas?

Steven G Over a year ago

http://pandas.pydata.org/pandas-docs/stable/indexing.html this is a start

qichao_he Over a year ago

haha I'm actually looking at it right now, again thanks a lot!

Steven G Over a year ago

@qichao_he You should accept the answer if this is to your satisfaction. have a good day

Corley Brigman Over a year ago

I know this is closed, but be aware that AFAIK this only works for dates (not even sure why it works for that, to be honest). Normally, df[...] is used to select columns, and df.loc[...] is used to select rows. Pandas is column-oriented (e.g. each column is a series, and the dataframe is an array of series), so extracting a column is fast (just return the series I already have), while extracting a row is slow (have to extract data out of all my collected series and create a new one). Normally, using df[...] on row indexes doesn't work...

Collectives™ on Stack Overflow

Querying Python Pandas DataFrame with a Datetime index or column

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related