1

I was benchmarking pandas DataFrame creation and found that it was more expensive than numpy ndarray creation.

Benchmark Code

from timeit import Timer
setup = """
import numpy as np
import pandas as pd
"""
numpy_code = """
data = np.zeros(shape=(360,),dtype=[('A', 'f4'),('B', 'f4'),('C', 'f4')])
"""
pandas_code ="""
df =pd.DataFrame(np.zeros(shape=(360,),dtype=[('A', 'f4'),('B', 'f4'),('C', 'f4')]))
"""
print "Numpy",min(Timer(numpy_code,setup=setup).repeat(10,10))*10**6,"micro-seconds"
print "Pandas",min(Timer(pandas_code,setup=setup).repeat(10,10))*10**6,"micro-seconds"

The output is

Numpy 17.5073728315 micro-seconds
Pandas 1757.9817013 micro-seconds

I was wondering if someone could help me understand why pandas DataFrame creation is more expensive than ndarray construction. And if I am doing something wrong, can you please help me improve performance.

System Details

pandas version: 0.12.0
numpy version: 1.9.0
Python 2.7.6 (32-bit) running on Windows 7
2
  • Of course it's more expensive. The test isn't even completely fair. data is created as a raw numpy ndarray. pandas still has to transform it to a tabular structure. That alone takes more time, no matter how small. Add to that the extra check for heterogeneity of data and you're already doing two extra steps more than just array creation. On my end, though, it's good to note that even using 0.14.1 pandas, the difference is nearly the same as yours, with numpy scoring 3.23ms versus pandas at 323ms. Commented Oct 24, 2014 at 15:47
  • Scratch the above, numpy is at 3.49 microseconds versus pandas at 323 microseconds. As to exactly why, I also await a very detailed explanation. :) Commented Oct 24, 2014 at 16:05

1 Answer 1

6

For a completely homogeneous dtyped numpy array, the performance difference for creations with be quite miniscule and no copying is done, and the array is simply passed thru.

However for heteregenous dtyped numpy arrays, the data IS segregated by dtype (which may involve copying, esp if your input has non-contiguous dtypes) into separate blocks each holding a single dtype (as a numpy array).

Other types of data trigger different amounts of checks (e.g. lists are scrutinized for if they are 1-d, 2-d etc), and various checks relating to coercions of datetime-likes occur.

The reasons for this upfront dtype separation are simple. You can then perform operations which operate differently on different dtypes without run-time separation (and the correspondent slicing performance issues).

To be very honest this is a very-very slight perf hit to take to get all of the attendent advantages of using a DataFrame, namely a consistent intuitive API that properly handles null-data and different dtypes intelligently.

Homogeous case, this involves NO copying

In [41]: %timeit np.ones((10000,100))
1000 loops, best of 3: 399 us per loop

In [42]: arr = np.ones((10000,100))

In [43]: %timeit DataFrame(arr)
10000 loops, best of 3: 65.9 us per loop
Sign up to request clarification or add additional context in comments.

7 Comments

+1: This is very informative. I was thinking that heterogeneity was part of the big hit, and the second paragraph indeed confirms it. Totally agree with the last paragraph as well.
I updated my question with homogenous dtyped array. The difference in time to create an ndarray vs DataFrame is still 2 orders of magnitude. I understand the convenience of DataFrame, I am trying to see if we can get that without compromising a lot on performance.
It may be possible to do a better job of converting a structured array (so that you take views of contiguous dtype blocks, not exactly sure). If this is a bottleneck, pls investigate.
I updated my answer with a single dtyped conversion.
@goutham I am puzzled why microseconds matter here?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.