I have a large dataset that I process using xarray+dask for scalability. These libraries work great for all of my calculations, except for one. The final step is to perform some statistics bootstrapping (on the largest dimension) and then calculate a variance over it.
The way I do it looks like this:
idx = xr.DataArray(
np.random.randint(0, projections.n.size, (sample_count, sample_size)),
dims=("sample", "n"),
)
bootstrapped_variations = xr.concat(
[projections.isel(n=i).var(dim="n").sum(dim="ReIm") for i in idx], dim="sample"
).chunk("auto")
This work for some sample sizes, but does not scale to larger ones and I get out of memory errors. I guess the main problem is creating so many new arrays when calling isel. The thing is that they should get reduced immediately, since we calculate their variance.
What I would like to do, is to create a numpy view of such a single sample, so as to not allocate new huge arrays in memory. Here I use advanced indexing, so a copy is returned.
It can be slower computationally, but I just wonder if it can be done. Perhaps there are some other efficient ways to do bootstrapping on large datasets? From what I've read np.random.choice also returns a copy, not a view.