Let's say I have a NumPy array:
x = np.array([3, 9, 2, 1, 5, 4, 7, 7, 8, 6])
If I sum up this array, I get 52. What I need is a way to split it up starting from left to right into roughly n chunks where n is chosen by the user. Essentially, the splits occur in a greedy fashion. So, for some number of chunks n, the first n - 1 chunks must each sum up to at least 52/n and they must be consecutive indices taken from left to right.
So, if n = 2 then the first chunk would consist of the first 7 elements:
chunk[0] = x[:7] # [3, 9, 2, 1, 5, 4, 7], sum = 31
chunk[1] = x[7:] # [7, 8, 6], sum = 21
Notice that the first chunk wouldn't consist of the first 6 elements only since the sum would be 24 which is less than 52/2 = 26. Also, notice that the number of elements in each chunk is allowed to vary as long as the sum criteria is met. Finally, it is perfectly fine for the last chunk to not be close to 52/2 = 26 since the other chunk(s) may take more.
However, the output that I need is a two column array that contains the start index in the first column and the (exclusive) stop index in the second column:
[[0, 7],
[7, 10]]
If n = 4, then the first 3 chunks need to each sum up to at least 52/4 = 13 and would look like this:
chunk[0] = x[:3] # [3, 9, 2], sum = 14
chunk[1] = x[3:7] # [1, 5, 4], sum = 17
chunk[2] = x[7:9] # [7, 8], sum = 15
chunk[3] = x[9:] # [6], sum = 6
And the output that I need would be:
[[0, 3],
[3, 7],
[7, 9],
[9, 10]
So, one naive approach using for loops might be:
ranges = np.zeros((n_chunks, 2), np.int64)
ranges_idx = 0
range_start_idx = start
sum = 0
for i in range(x.shape[0]):
sum += x[i]
if sum > x.sum() / n_chunks:
ranges[ranges_idx, 0] = range_start_idx
ranges[ranges_idx, 1] = min(
i + 1, x.shape[0]
) # Exclusive stop index
# Reset and Update
range_start_idx = i + 1
ranges_idx += 1
sum = 0
# Handle final range outside of for loop
ranges[ranges_idx, 0] = range_start_idx
ranges[ranges_idx, 1] = x.shape[0]
if ranges_idx < n_chunks - 1:
left[ranges_idx:] = x.shape[0]
return ranges
I am looking for a nicer vectorized solution.
np.cumsum().