2

I have a HDF5 file that I would like to load into a list of Dask DataFrames. I have set this up using a loop following an abbreviated version of the Dask pipeline approach. Here is the code:

import pandas as pd
from dask import compute, delayed
import dask.dataframe as dd
import os, h5py

@delayed
def load(d,k):
    ddf = dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
    return ddf

if __name__ == '__main__':      
    d = 'C:\Users\User\FileD'
    loaded = [load(d,'/DF'+str(i)) for i in range(1,10)]

    ddf_list = compute(*loaded)
    print(ddf_list[0].head(),ddf_list[0].compute().shape)

I get this error message:

C:\Python27\lib\site-packages\tables\group.py:1187: UserWarning: problems loading leaf ``/DF1/table``::

  HDF5 error back trace

  File "..\..\hdf5-1.8.18\src\H5Dio.c", line 173, in H5Dread
    can't read data
  File "..\..\hdf5-1.8.18\src\H5Dio.c", line 543, in H5D__read
    can't initialize I/O info
  File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 841, in H5D__chunk_io_init
    unable to create file chunk selections
  File "..\..\hdf5-1.8.18\src\H5Dchunk.c", line 1330, in H5D__create_chunk_file_map_hyper
    can't insert chunk into skip list
  File "..\..\hdf5-1.8.18\src\H5SL.c", line 1066, in H5SL_insert
    can't create new skip list node
  File "..\..\hdf5-1.8.18\src\H5SL.c", line 735, in H5SL_insert_common
    can't insert duplicate key

End of HDF5 error back trace

Problems reading the array data.

The leaf will become an ``UnImplemented`` node.
  % (self._g_join(childname), exc))

The message mentions a duplicate key. I iterated over the first 9 files to test out the code and, in the loop, I am using each iteration to assemble a different key that I use with dd.read_hdf. Across all iterations, I'm keeping the filename is the same - only the key is being changed.

I need to use dd.concat(list,axis=0,...) in order to vertically concatenate the contents of the file. My approach was to load them into a list first and then concatenate them.

I have installed PyTables and h5Py and have Dask version 0.14.3+2.

With Pandas 0.20.1, I seem to get this to work:

for i in range(1,10):
    hdf = pd.HDFStore(os.path.join(d,'Cleaned.h5'),mode='r')
    df = hdf.get('/DF{}' .format(i))
    print df.shape
    hdf.close()

Is there a way I can load this HDF5 file into a list of Dask DataFrames? Or is there another approach to vertically concatenate them together?

1 Answer 1

5

Dask.dataframe is already lazy, so there is no need to use dask.delayed to make it lazier. You can just call dd.read_hdf repeatedly:

ddfs = [dd.read_hdf(os.path.join(d,'Cleaned.h5'), key=k)
        for k in keys]

ddf = dd.concat(ddfs)
Sign up to request clarification or add additional context in comments.

4 Comments

I had missed that. Thanks!
Is it possible to use mixed delayed and non-delayed functions in the same pipeline?
See these docs for how to convert between delayed values and dask.dataframes. There is no reason to nest lazy functions within lazy functions.
You can also pass a list of paths directly to one invocation of dd.read_hdf.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.