1

I have a large dataset that I process using xarray+dask for scalability. These libraries work great for all of my calculations, except for one. The final step is to perform some statistics bootstrapping (on the largest dimension) and then calculate a variance over it.

The way I do it looks like this:

idx = xr.DataArray(
    np.random.randint(0, projections.n.size, (sample_count, sample_size)),
    dims=("sample", "n"),
)

bootstrapped_variations = xr.concat(
    [projections.isel(n=i).var(dim="n").sum(dim="ReIm") for i in idx], dim="sample"
).chunk("auto")

This work for some sample sizes, but does not scale to larger ones and I get out of memory errors. I guess the main problem is creating so many new arrays when calling isel. The thing is that they should get reduced immediately, since we calculate their variance.

What I would like to do, is to create a numpy view of such a single sample, so as to not allocate new huge arrays in memory. Here I use advanced indexing, so a copy is returned.

It can be slower computationally, but I just wonder if it can be done. Perhaps there are some other efficient ways to do bootstrapping on large datasets? From what I've read np.random.choice also returns a copy, not a view.

1
  • 1
    You should consider posting this question in discourse.pangeo.io. Commented Apr 7, 2023 at 5:57

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.