0

Hi! Are there any ways to load large, (ideally) compressed, and columnar-structured data faster into NumPy arrays in Python? Considering common solutions such as Pandas, Apache Parquet/Feather and HDF5, I am struggling to find a suiting way for my (time-series) problem.

As was expected, representing my data as NumPy array yields, by far, the fastest execution time for search problems such as binary search, significantly outperforming the same analysis when applied on a Pandas dataframe instead. On the other hand, when I try to store my data as npz files, for instance, directly loading the npz into NumPy arrays takes much longer compared to loading the same data into a Dataframe using the fasterparquet engine and the columnar-storage in .parquet. This loading, however, requires me to call .to_numpy() on the resulting dataframe, which now again causes heavy delays in accessing the underlying numpy representation of the dataframe.

As mentioned above, one alternative I tried was to store the data in a format, that can be loaded without any intermediate conversion steps into a numpy array. However, loading time appears to be much slower when the data is stored as .npz file (table with > 10M records and > 10 columns) compared to the same data stored as .parquet file.

1
  • Try saving to h5 file, using the h5py library. See this example Commented Nov 19, 2023 at 2:15

1 Answer 1

0

Actually, fastparquet supports loading your data into a dictionary of numpy arrays, if you set these up before hand. This is a "hidden" feature. If you give details of the dtype and size of the data you wish to load. this answer can be edited accordingly.

to call .to_numpy() on the resulting dataframe, which now again causes heavy delays

This is very surprising, it should normally be a copy-free view of the same underlying data.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.