7

I have a calculation that expects a pandas dataframe as input. I'd like to run this calculation on data stored in a netCDF file that expands to 51GB - currently I've been opening the file with xarray.open_dataset and using chunks (my understanding is that this opened file is actually a dask array, so only loads chunks of data into memory at a time). However, I don't seem to be able to take advantage of this lazy loading, since I have to convert the xarray data into a pandas dataframe in order to run my calculation - and my understanding is that at that point all the data is loaded into memory (which is bad).

So I guess long story short, my question is: how can I get from an xarray dataset to a pandas dataframe without any intermediate steps that load my entire data into memory? I've seen dask work with pandas.read_csv, and I see it work with xarray, but I'm not sure how to convert an already opened netCDF xarray dataset to a pandas dataframe in chunks.

Thanks and sorry for the vague question!

1 Answer 1

6

This is a good question. This should be doable, but I'm not quite sure what the right approach is.

Ideally, we could simply implement a xarray.Dataset.to_dask_dataframe() method. But there are several challenges here -- the biggest one being that dask currently does not support dataframes with a MultiIndex.

Alternatively, you might want to construct a list of dask.Delayed objects holding pandas.DataFrames for each chunk of the xarray.Dataset. Toward this end, it would be nice if xarray had something like dask.array's to_delayed method for converting a Dataset into an array of delayed datasets, which you could then lazily convert into DataFrame objects and do your computation.

I encourage you to open an issue on either the dask or xarray GitHub pages to discuss, especially if you might be interested in contributing code. EDIT: You can find that issue here.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.