4

So I have 2d numpay array arr. It's a relatively big one: arr.shape = (2400, 60000)

What I'm currently doing is the following:

  • randomly (with replacement) select arr.shape[0] indices
  • access (row-wise) chosen indices of arr
  • calculating column-wise averages and selecting max value
  • I'm repeating it for k times

It looks sth like:

no_rows = arr.shape[0]
indicies = np.array(range(no_rows))
my_vals = []
for k in range(no_samples):
    random_idxs = np.random.choice(indicies, size=no_rows, replace=True)
    my_vals.append(
        arr[random_idxs].mean(axis=0).max()
    )

My problem is that is very slow. With my arr size, it takes ~3s for 1 loop. As I want a sample that is bigger than 1k - my current solution solution pretty bad (1k*~3s -> ~1h). I've profiled it and the bottleneck is accessing row based on indices. "mean" and "max" work fast. np.random.choice is also ok.

Do you see any area for improvement? A more efficient way of accessing indices or maybe better a faster approach that solves the problem without this?

What I tried so far:

  • numpy.take (slower)
  • numpy.ravel :

sth similar to:

random_idxs = np.random.choice(sample_idxs, size=sample_size, replace=True) 
test = random_idxs.ravel()[arr.ravel()].reshape(arr.shape)
  • similar approach to current one but without loop. I created 3d arr and accessed rows across additional dimension in one go
2
  • What is no_samples in your code? Commented Jul 30, 2020 at 8:47
  • no_samples is just integer defining sample size. I want it to be >1k Commented Jul 30, 2020 at 9:05

2 Answers 2

2

Since advanced indexing will generate a copy, the program will allocate huge memory in arr[random_idxs].

So one of the most simple way to improve efficiency is that do things batch wise.

BATCH = 512
max(arr[random_idxs,i:i+BATCH].mean(axis=0).max() for i in range(0,arr.shape[1],BATCH))
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you - that indeed helps! I've managed to speed it up ~3.5x times with a batch approach. Single iteration got reduced from ~3s to ~0.79. I'd add only that it's good to play with appropriate batch size. I've run several tests. Here are results (BATCH, avg_time_in_100_iterations): (16, 1.13), (32, 0.87), (64, 0.79), (128, 1.2), (256, 1.33), (512, 1.8) and (1024, 2.47)
0

This is not a general solution to the problem, but should make your specific problem much faster. Basically, arr.mean(axis=0).max() won't change, so why not take random samples from that array?

Something like:

mean_max = arr.mean(axis=0).max()
my_vals = np.array([np.random.choice(mean_max, size=len(mean_max), replace=True) for i in range(no_samples)])

You may even be able to do: my_vals = np.random.choice(mean_max, size=(no_samples, len(mean_max)), replace=True), but I'm not sure how, if at all, that would change your statistics.

1 Comment

Not sure if I follow. arr.mean(axis=0).max() WILL change - that's the whole point. I'm choosing random indices with replacement. So every iteration, means for each column should be different. Then I'm choosing the highest value from those columns means and appending it as a data point for my final list. As for your solution - it won't run. mean_max you proposed is a single float number. That's just the highest mean value from arr (where means are calculated column-wise). So you try to randomly choose a single value from a single value available.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.