2

Let's say I have a NumPy array:

x = np.array([3, 9, 2, 1, 5, 4, 7, 7, 8, 6])

If I sum up this array, I get 52. What I need is a way to split it up starting from left to right into roughly n chunks where n is chosen by the user. Essentially, the splits occur in a greedy fashion. So, for some number of chunks n, the first n - 1 chunks must each sum up to at least 52/n and they must be consecutive indices taken from left to right.

So, if n = 2 then the first chunk would consist of the first 7 elements:

chunk[0] = x[:7]  # [3, 9, 2, 1, 5, 4, 7], sum = 31
chunk[1] = x[7:]  # [7, 8, 6], sum = 21

Notice that the first chunk wouldn't consist of the first 6 elements only since the sum would be 24 which is less than 52/2 = 26. Also, notice that the number of elements in each chunk is allowed to vary as long as the sum criteria is met. Finally, it is perfectly fine for the last chunk to not be close to 52/2 = 26 since the other chunk(s) may take more.

However, the output that I need is a two column array that contains the start index in the first column and the (exclusive) stop index in the second column:

[[0, 7],
 [7, 10]]

If n = 4, then the first 3 chunks need to each sum up to at least 52/4 = 13 and would look like this:

chunk[0] = x[:3]  # [3, 9, 2], sum = 14
chunk[1] = x[3:7]  # [1, 5, 4], sum = 17
chunk[2] = x[7:9]  # [7, 8], sum = 15
chunk[3] = x[9:]  # [6], sum = 6

And the output that I need would be:

[[0, 3],
 [3, 7],
 [7, 9],
 [9, 10]

So, one naive approach using for loops might be:


ranges = np.zeros((n_chunks, 2), np.int64)
ranges_idx = 0
range_start_idx = start

sum = 0
for i in range(x.shape[0]):
    sum += x[i]
    if sum > x.sum() / n_chunks:
        ranges[ranges_idx, 0] = range_start_idx
        ranges[ranges_idx, 1] = min(
                i + 1, x.shape[0]
            )  # Exclusive stop index
        # Reset and Update
        range_start_idx = i + 1
        ranges_idx += 1
        sum = 0
# Handle final range outside of for loop
ranges[ranges_idx, 0] = range_start_idx
ranges[ranges_idx, 1] = x.shape[0]
if ranges_idx < n_chunks - 1:
    left[ranges_idx:] = x.shape[0]

return ranges

I am looking for a nicer vectorized solution.

10
  • 1
    Where is the coding attempt to solve this? You've describe the algorithm -- first implementation attempt generally falls on the poster (i.e. you). Commented Apr 30, 2020 at 20:41
  • 1
    What happens if the list is exhausted before reaching N chunks? For instance, N=6, x=[3, 3, 3, 11, 11, 11]. You get chunks of [3, 3, 3], [11], [11], [11], and then you have no list remaining for the final two chunks. Commented Apr 30, 2020 at 20:43
  • @Prune Oops, I forgot the most important part. I've updated the question with code. In the case of exhaustion, you can end immediately. But in the real case, the list will always be long enough to fill up all of the chunks. Commented Apr 30, 2020 at 21:12
  • 1
    Ok, now it's very different question. I'm not sure what you mean by 'vectorized' in this case. Maybe use np.cumsum(). Commented Apr 30, 2020 at 21:17
  • @rpoleski "Vectorized" as in NumPy vectorized functions that allows us to avoid explicit for-loops Commented Apr 30, 2020 at 21:19

2 Answers 2

3

I found inspiration in a similar question that was answered:

def func(x, n):
    out = np.zeros((n, 2), np.int64)
    cum_arr = x.cumsum() / x.sum()
    idx = 1 + np.searchsorted(cum_arr, np.linspace(0, 1, n, endpoint=False)[1:])
    out[1:, 0] = idx  # Fill the first column with start indices
    out[:-1, 1] = idx  # Fill the second column with exclusive stop indices
    out[-1, 1] = x.shape[0]  # Handle the stop index for the final chunk
    return out

Update

To cover the pathological case, we need to be a little more precise and do something like:

def func(x, n, truncate=False):
    out = np.zeros((n_chunks, 2), np.int64)
    cum_arr = x.cumsum() / x.sum()
    idx = 1 + np.searchsorted(cum_arr, np.linspace(0, 1, n, endpoint=False)[1:])
    out[1:, 0] = idx  # Fill the first column with start indices
    out[:-1, 1] = idx  # Fill the second column with exclusive stop indices
    out[-1, 1] = x.shape[0]  # Handle the stop index for the final chunk

    # Handle pathological case
    diff_idx = np.diff(idx)
    if np.any(diff_idx == 0):
        row_truncation_idx = np.argmin(diff_idx) + 2
        out[row_truncation_idx:, 0] = x.shape[0]
        out[row_truncation_idx-1:, 1] = x.shape[0]
        if truncate:
            out = out[:row_truncation_idx]

    return out
Sign up to request clarification or add additional context in comments.

2 Comments

Note that for pathological case pointed out by @Prune, this code returns 2 rows more than my code.
Fair point. I've updated my answer to, hopefully, cover the pathological case
1

Here is a solution that doesn't iterate over all elements:

def fun2(array, n):
    min_sum = np.sum(array) / n
    cumsum = np.cumsum(array)
    i = -1
    count = min_sum
    out = []
    while i < len(array)-1:
        j = np.searchsorted(cumsum, count) 
        out.append([i+1, j+1])
        i = j 
        if i < len(array):
            count = cumsum[i] + min_sum
    out[-1][1] -= 1
    return np.array(out)

For the two test cases it produces the results you expected. HTH

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.