2

Is it possible to load partial chunks of a DataArray (stored as single netcdf file) from disk into memory (i.e. not load the whole data-array at once) but without using dask-dataarrays?

The issue is that I'm using dask as my cluster scheduler to submit jobs and within those jobs - I want to page a dataarray into memory from disk in small pieces. Dask unfortunately does not like nested dask-schedulers so trying to load that dataarray as per da = xr.open_datarray( file, chunks={'time':1000} ) doesn't work (causes dask to throw nested daemonic process errors).

Ideally, I'd like to do something like this - without having the whole dataarray loaded into memory, but only the relevant pieces:

da = xr.open_datarray( my_file )  # lazy open the file
for t in range( 0, len( da ), 1000 ) :
    da_actual = da[t:t+1000].load() # materialize the data into memory
    # do some compute with da_actual

Any pointers / ideas on how to achieve this would be appreciated

1 Answer 1

1

Wrapping this with delayed might help:

import dask

@dask.delayed
def custom_array_func(my_file):
    da = xr.open_datarray( my_file )  # lazy open the file
        for t in range( 0, len( da ), 1000 ) :
            da_actual = da[t:t+1000].load() # materialize the data into memory
            # do some compute with da_actual
    return final_result # or can return None if nothing is needed

[computed_results] = dask.compute([custom_array_func(my_file) for my_file in list_of_files])
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.