0

I have to process hdf5 files. Each of them contains data that can be loaded into a pandas DataFrame formed by 100 columns and almost 5E5 rows. Each hdf5 file weighs approximately 130MB.

So I want to fetch the data from the hdf5 file then apply some processing and finally save the new data in a csv file. In my case, the performance of the process is very important because I will have to repeat it.

So far I have focused on Pandas and Dask to get the job done. Dask is good for parallelization and I will get good processing times with a stronger PC and more CPUs.

However some of you have already encountered this problem and found the best optimization ?

3
  • 3
    Welcome to StackOverflow! It's difficult to provide specific guidance without more detail. Your question may be downvoted or closed because it's pretty open-ended. In general, I'd say that dask and pandas are good libraries, and also converting a lot of data from HDF5 to CSV involves moving from a compressed, binary storage format intended for high-volume data into a human-readible inefficient storage format prone to encoding errors and other issues. If you have to do this, then you're probably off to the right start. But my only advice would be to try not to do this :) Good luck! Commented Feb 21, 2021 at 20:53
  • @Michael Delgado makes good points about file size and performance of HDF5 vs CSV. Another consideration: you will now have to track the HDF5 file AND exported CSV files. You should only do this if you HAVE to. Otherwise, you're better off writing code to read the HDF5 data in native format. Commented Feb 22, 2021 at 14:37
  • Thank you @Michael Delgado and @ kcw78 for your comments. You are right. If some people wonder if it could be interesting to work with csv files rather than hdf5 your comments give good arguments for not choosing this option. If didn't have to do this, I wouldn't convert hdf5 to csv. Commented Feb 22, 2021 at 19:29

1 Answer 1

1

As others have mentioned in the comments, unless you have to move it to CSV, I'd recommend keeping it in HDF5. However, below is a description of how you might do it if you do have to carry out the conversion.

It sounds like you have a function for loading the HDF5 file into a pandas data frame. I would suggest using dask's delayed API to create a list of delayed pandas data frames, and then convert them into a dask data frame. The snipped below is copied from the linked page, with an added line to save to CSV.

import dask.dataframe as dd
from dask.delayed import delayed

from my_custom_library import load

filenames = ...
dfs = [delayed(load)(fn) for fn in filenames]

df = dd.from_delayed(dfs)
df.to_csv(filename, **kwargs)

See dd.to_csv() documentation for info on options for saving to CSV.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @natemcintosh. Your suggestion to use delayed Dask seems relevant. I can consider my hdf5 files set as a collection which can be loaded into a single dataframe dask with dd.from_delayed. Then it is more efficient to do a df.apply on this single dataframe. So I obtain 8% time processing gain compared to an individual processing of the hdf5 files. Unfortunately my treatment df.apply returns a pandas.Series and I struggle to produce csv files which could be done with dd.to_delayed method. Anyway pandas and dask option is good according you and Michael Delgado. I will dig into it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.