5

So, I'm new about the package Pandas. I was doing some back test on a strategy on ETFs, that I need to do a lot of queries on Pandas Dataframe.

So let's say I'm these two DataFrames, df and df1, the only difference is that: df has datetime Index, while df1 has the timestamp as a column and an integer Index

In[104]: df.head()
Out[104]: 

                       high     low    open   close   volume  openInterest
2007-04-24 09:31:00  148.28  148.12  148.23  148.15  2304400        341400
2007-04-24 09:32:00  148.21  148.14  148.14  148.19  2753500        449100
2007-04-24 09:33:00  148.24  148.13  148.18  148.14  2863400        109900
2007-04-24 09:34:00  148.18  148.12  148.13  148.16  3118287        254887
2007-04-24 09:35:00  148.17  148.14  148.16  148.16  3202112         83825

In[105]: df1.head()
Out[105]: 

                dates    high     low    open   close   volume  openInterest
0 2007-04-24 09:31:00  148.28  148.12  148.23  148.15  2304400        341400
1 2007-04-24 09:32:00  148.21  148.14  148.14  148.19  2753500        449100
2 2007-04-24 09:33:00  148.24  148.13  148.18  148.14  2863400        109900
3 2007-04-24 09:34:00  148.18  148.12  148.13  148.16  3118287        254887
4 2007-04-24 09:35:00  148.17  148.14  148.16  148.16  3202112         83825

so I test the query speed a little bit:

In[100]: %timeit df1[(df1['dates'] >= '2015-11-17') & (df1['dates'] < '2015-11-18')]
%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
100 loops, best of 3: 4.67 ms per loop
100 loops, best of 3: 3.14 ms per loop
1 loop, best of 3: 259 ms per loop

To my surprise is that using the logic built in with Pandas is actually the slowest:

df.loc['2015-11-17']

Does anyone know why is that? And are there any documents or blogs about the most efficient ways to query a Pandas DataFrame?

1 Answer 1

4

If I were you I would use the simpler method:

df['2015-11-17']  

in my opinion this would be more 'pandas logic' than using .loc[] for a single date. I am guessing it is also faster.

testing on a minute OHLC dataframe:

%timeit df.loc[(df.index >= '2015-11-17') & (df.index < '2015-11-18')]
%timeit df.loc['2015-11-17']
%timeit df['2015-11-17']

100 loops, best of 3: 13.8 ms per loop
1 loop, best of 3: 1.39 s per loop
1000 loops, best of 3: 486 us per loop
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks, man! I tested it as well, the simplest method is the fastest, as it should be. Do you know where I can get a whole picture about different methods of querying with Pandas?
haha I'm actually looking at it right now, again thanks a lot!
@qichao_he You should accept the answer if this is to your satisfaction. have a good day
I know this is closed, but be aware that AFAIK this only works for dates (not even sure why it works for that, to be honest). Normally, df[...] is used to select columns, and df.loc[...] is used to select rows. Pandas is column-oriented (e.g. each column is a series, and the dataframe is an array of series), so extracting a column is fast (just return the series I already have), while extracting a row is slow (have to extract data out of all my collected series and create a new one). Normally, using df[...] on row indexes doesn't work...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.