4

The aim here is speed- I am trying to get away from looping through the arrays in question. It can however be assumed that the two arrays are sorted.

a = np.arange(10)
b = np.array([2.3, 3.5, 5.8, 13])
c = somefunc(a,b)

Now somefunc should find the indices of a for which the values in b are closest too, i.e.

In []: c
Out[]: array([2, 3or4, 6, 9])  #3 or 4 depending on python2 or 3

Once again, this could be done with a loop, but I am looking for something a lot faster. I got quite close by taking the absolute difference type approach, something like:

np.argmin(np.abs(a[:, np.newaxis] - b), axis=0)

But even this is a little slow as a lot of unnecessary subtractions are done.

4
  • I think you probably want c to be array([2, 3, 6, 9]), because you're comparing with arange(10), which starts from 0. Commented Jun 13, 2017 at 16:33
  • What do you mean when you say your argmin result doesn't give a single index per b value? It does for me... Commented Jun 13, 2017 at 16:34
  • Shouldn't that be [2,4,6,9] instead? Commented Jun 13, 2017 at 17:31
  • Sorry, yes, it should be [2, 3 or 4, 6, 9] depending on it is is python 2 or 3. Commented Jun 14, 2017 at 7:16

3 Answers 3

2

So using the suggestion from @Eelco to use searchsorted, I came to the following which is quicker with a larger dataset than the np.argmin on the vector method:

def finder(a, b):
    dup = np.searchsorted(a, b)
    uni = np.unique(dup)
    uni = uni[uni < a.shape[0]]
    ret_b = np.zeros(uni.shape[0])
    for idx, val in enumerate(uni):
        bw = np.argmin(np.abs(a[val]-b[dup == val]))
        tt = dup == val
        ret_b[idx] = np.where(tt == True)[0][bw]
    return np.column_stack((uni, ret_b))
Sign up to request clarification or add additional context in comments.

Comments

0

keep track of two pointers, one for the current index for a, and one for b. as we increment pointer a, we keep track of the min difference and index between the elements being pointed to, until pointed_a > pointed_b. update the min difference and index again (if there are changes). there we have our answer for the first element. continue the search similarly by increasing the pointer for b, and pick up pointer a from where we left off.

complexity: O( len a + len b ), therefore linear

1 Comment

If log a is ω(log(b) * log(log(b))) then this can be optimized to len a binary searches, which improves efficiency.
0

scipy.spatial.cKDTree is the simplest solution to this problem; vectorized and more than likely good enough for your application; but not theoretically optimal given that your data is sorted.

Alternatively, you can use numpy.searchsorted. Use this to find the left-or right sided insertion point, and then compare that point and the next one to find the closest point.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.