I was benchmarking pandas DataFrame creation and found that it was more expensive than numpy ndarray creation.
Benchmark Code
from timeit import Timer
setup = """
import numpy as np
import pandas as pd
"""
numpy_code = """
data = np.zeros(shape=(360,),dtype=[('A', 'f4'),('B', 'f4'),('C', 'f4')])
"""
pandas_code ="""
df =pd.DataFrame(np.zeros(shape=(360,),dtype=[('A', 'f4'),('B', 'f4'),('C', 'f4')]))
"""
print "Numpy",min(Timer(numpy_code,setup=setup).repeat(10,10))*10**6,"micro-seconds"
print "Pandas",min(Timer(pandas_code,setup=setup).repeat(10,10))*10**6,"micro-seconds"
The output is
Numpy 17.5073728315 micro-seconds
Pandas 1757.9817013 micro-seconds
I was wondering if someone could help me understand why pandas DataFrame creation is more expensive than ndarray construction. And if I am doing something wrong, can you please help me improve performance.
System Details
pandas version: 0.12.0
numpy version: 1.9.0
Python 2.7.6 (32-bit) running on Windows 7
datais created as a rawnumpyndarray.pandasstill has to transform it to a tabular structure. That alone takes more time, no matter how small. Add to that the extra check for heterogeneity of data and you're already doing two extra steps more than just array creation. On my end, though, it's good to note that even using0.14.1pandas, the difference is nearly the same as yours, withnumpyscoring 3.23ms versuspandasat 323ms.numpyis at 3.49 microseconds versuspandasat 323 microseconds. As to exactly why, I also await a very detailed explanation. :)