7

Assume we have a numpy array A with shape (N, ) and a matrix D with shape (M, 3) which has data and another matrix I with shape (M, 3) which has corresponding index of each data element in D. How can we construct A given D and I such that the repeated element indexes are added?

Example:

############# A[I] := D ###################################  
A = [0.5, 0.6]                         # Final Reduced Data Vector
D = [[0.1, 0.1 0.2], [0.2, 0.4, 0.1]]  # Data
I = [[0, 1, 0], [0, 1, 1]]             # Indices

For example:

A[0] = D[0][0] + D[0][2] + D[1][0]     # 0.5 = 0.1 + 0.2 + 0.2

Since in index matrix we have:

I[0][0] = I[0][2] = I[1][0] = 0

Target is to avoid looping over all elements to be efficient for large N, M (10^6-10^9).

9
  • 2
    Question needs more clarification. What do you mean "repeated element indexes are added"? Are you getting 0.5 by doing 0.1+0.4? If so how are you getting 0.6? Commented Dec 2, 2020 at 6:36
  • 1
    Iterate over I and add A[I[x][y]] += D[x][y] Commented Dec 2, 2020 at 7:49
  • 1
    Are all indices guaranteed to appear at least once in I? Commented Dec 2, 2020 at 7:56
  • 1
    On second thought it doesn't matter Commented Dec 2, 2020 at 8:27
  • 1
    M does not matter though - what MIGHT matter is the number of integers in I Commented Dec 6, 2020 at 21:13

3 Answers 3

6
+50

I doubt you can get much faster than np.bincount - and notice how the official documentation provides this exact usecase

# Your example
A = [0.5, 0.6]
D = [[0.1, 0.1, 0.2], [0.2, 0.4, 0.1]]
I = [[0, 1, 0], [0, 1, 1]]

# Solution
import numpy as np    
D, I = np.array(D).flatten(), np.array(I).flatten()
print(np.bincount(I, D)) #[0.5 0.6]
Sign up to request clarification or add additional context in comments.

1 Comment

On a second look, using ravel instead of flatten might give us a slight boost in performance here, but I don't think that's very relevant relative to the actual binning and counting phase.
3

The shape of I and D doesn't matter: you can clearly ravel the arrays without changing the outcome:

index = np.ravel(I)
data = np.ravel(D)

Now you can sort both arrays according to I:

sorter = np.argsort(index)
index = index[sorter]
data = data[sorter]

This is helpful because now index looks like this:

0, 0, 0, 1, 1, 1

And data is this:

0.1, 0.2, 0.2, 0.1, 0.4, 0.1

Adding together runs of consecutive numbers should be easier than processing random locations. Let's start by finding the indices where the runs start:

runs = np.r_[0, np.flatnonzero(np.diff(index)) + 1]

Now you can use the fact that ufuncs like np.add have a partial reduce operation called reduceat. This allows you to sum regions of an array:

a = np.add.reduceat(data, runs)

If I is guaranteed to contain all indices in [0, A.size) at least once, you're done: just assign to A instead of a. If not, you can make the mapping using the fact that the start of each run in index is the target index:

A = np.zeros(n)
A[index[runs]] = a

Algorithmic complexity analysis:

  • ravel is O(1) in time and space if the data is in an array. If it's a list, this is O(MN) in time and space
  • argsort is O(MN log MN) in time and O(MN) in space
  • Indexing by sorter is O(MN) in time and space
  • Computing runs is O(MN) in time and O(MN + M) = O(MN) in space
  • reduceat is a single pass: O(MN) in time, O(M) in space
  • Reassigning A is O(M) in time and space

Total: O(MN log MN) time, O(MN) space

TL;DR

def make_A(D, I, M):
    index = np.ravel(I)
    data = np.ravel(D)
    sorter = np.argsort(index)
    index = index[sorter]

    if index[0] < 0 or index[-1] >= M:
        raise ValueError('Bad indices')

    data = data[sorter]
    runs = np.r_[0, np.flatnonzero(np.diff(index)) + 1]
    a = np.add.reduceat(data, runs)
    if a.size == M:
        return a
    A = np.zeros(M)
    A[index[runs]] = a
    return A

3 Comments

Thank you! This is very smart way to do it, I would just wait few days in case, someone finds more efficient algorithm.
Take your time. You theoretically do this faster using a hash table, since that has O(1) lookup time, but I suspect your data isn't large enough for the savings in complexity to overtake the overhead of python function calls.
@Roy. Made a couple of typo fixes. Enjoy
1

If you know the size of A beforehand, as it seems you do, you can simply use add.at:

import numpy as np

D = [[0.1, 0.1, 0.2], [0.2, 0.4, 0.1]]
I = [[0, 1, 0], [0, 1, 1]]

arr_D = np.array(D)
arr_I = np.array(I)

A = np.zeros(2)

np.add.at(A, arr_I, arr_D)

print(A)

Output

[0.5 0.6]

If you don't know the size of A, you can use max to compute it:

A = np.zeros(arr_I.max() + 1)
np.add.at(A, arr_I, arr_D)
print(A)

Output

[0.5 0.6]

The time complexity of this algorithm is O(N), with also space complexity O(N).

The:

arr_I.max() + 1

is what bincount does under the hood, from the documentation:

The result of binning the input array. The length of out is equal to np.amax(x)+1.

That being said, bincount is at least one order of magnitude faster:

I = np.random.choice(1000, size=(1000, 3), replace=True)
D = np.random.random((1000, 3))
%timeit make_A_with_at(I, D, 1000)
213 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit make_A_with_bincount(I, D)
11 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

1 Comment

Thanks for detailed explanation and timing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.