I have a dasked xarray which is about 150k x 90k with chunk size of 8192 x 8192. I am working on a Window virtual machine which has 100gb RAM and 16 cores.
I want to plot it using the Datashader package. From reading the docs, Datashader does support dasked xarray. I am a beginner in using Dask, though my (limited) understanding is that since the xarray is chuncked, Datashader should only need to read one chunk at a time (or something to that effect) and do the required processing to produce the plot, so it doesn't need to read the entire xarray in-memory. Moreover, it should be able to leverage parallelism to speed up this process.
However, when I try to create a plot after about 15 seconds I see the RAM starts increasing steadily (and CPU > 95%) until I eventually manually interrupt the code when it reaches about 90gb in RAM. After interrupting, below is a part of the error message.
> KeyboardInterrupt Traceback (most recent call
> last) Cell In[4], line 1
> ----> 1 tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))
>
> File
> <PATH>\lib\site-packages\datashader\transfer_functions\__init__.py:717,
> in shade(agg, cmap, color_key, how, alpha, min_alpha, span, name,
> color_baseline, rescale_discrete_levels)
> 713 return _apply_discrete_colorkey(
> 714 agg, color_key, alpha, name, color_baseline
> 715 )
> 716 else:
> --> 717 return _interpolate(agg, cmap, how, alpha, span, min_alpha, name,
> 718 rescale_discrete_levels)
> 719 elif agg.ndim == 3:
> 720 return _colorize(agg, color_key, how, alpha, span, min_alpha, name, color_baseline,
> 721 rescale_discrete_levels)
>
> File
> <PATH>\lib\site-packages\datashader\transfer_functions\__init__.py:256,
> in _interpolate(agg, cmap, how, alpha, span, min_alpha, name,
> rescale_discrete_levels)
> 254 data = agg.data
> 255 if isinstance(data, da.Array):
> --> 256 data = data.compute()
> 257 else:
> 258 data = data.copy()
From the above error message, it seems like compute is being run on the xarray which would perhaps explain why the RAM is increasing so much? If that is the case, I am confused why compute is being run since I thought the whole point of using dasked xarrays is to avoid reading the entire xarray. I would have thought it should have been possible to even create this plot using a regular computer with 16gb RAM, with the main limiting factor being the chunk size and some extra space for dask/datashader to work in the background.
I have almost certainly mis-understood some things. It would be much appreciated if anyone could explain what is happening here and how I can go about plotting the array that I have.
Minimal reproducible example
# import libraries
import numpy as np
import dask.array as da
import datashader as ds
from datashader import transfer_functions as tf
import xarray as xr
# create large dask array
dask_array = da.random.random((100000, 100000), chunks=(1000, 1000))
# convert to dasked xarray
dask_xarray = xr.DataArray(
dask_array,
dims=["x", "y"],
coords={"x": np.arange(100000), "y": np.arange(100000)},
name="example_data" # Name of the data variable
)
# create plot using Datashader
tf.shade(ds.Canvas(plot_height=300, plot_width=300).raster(dask_xarray))