10

I have two numpy arrays x and y containing float values. For each value in x, I want to find the closest element in y, without reusing elements from y. The output should be a 1-1 mapping of indices of elements from x to indices of elements from y. Here's a bad way to do it that relies on sorting. It removes each element that was paired from the list. Without sorting this would be bad because the pairing would depend on the order of the original input arrays.

def min_i(values):
    min_index, min_value = min(enumerate(values),
                               key=operator.itemgetter(1))
    return min_index, min_value

# unsorted elements
unsorted_x = randn(10)*10
unsorted_y = randn(10)*10

# sort lists
x = sort(unsorted_x)
y = sort(unsorted_y)

pairs = []
indx_to_search = range(len(y))

for x_indx, x_item in enumerate(x):
    if len(indx_to_search) == 0:
        print "ran out of items to match..."
        break
    # until match is found look for closest item
    possible_values = y[indx_to_search]
    nearest_indx, nearest_item = min_i(possible_values)
    orig_indx = indx_to_search[nearest_indx]
    # remove it
    indx_to_search.remove(orig_indx)
    pairs.append((x_indx, orig_indx))
print "paired items: "
for k,v in pairs:
    print x[k], " paired with ", y[v]

I prefer to do it without sorting the elements first, but if they are sorted then I want to get the indices in the original, unsorted lists unsorted_x, unsorted_y. what's the best way to do this in numpy/scipy/Python or using pandas? thanks.

edit: to clarify I'm not trying to find the best fit across all elemets (not minimizing sum of distances for example) but rather the best fit for each element, and it's okay if it's sometimes at the expense of other elements. I assume that y is generally much larger than x contrary to above example and so there's usually many very good fits for each value of x in y, and I just want to find that one efficiently.

can someone show an example of scipy kdtrees for this? The docs are quite sparse

kdtree = scipy.spatial.cKDTree([x,y])
kdtree.query([-3]*10) # ?? unsure about what query takes as arg
14
  • I think a sort with a binary search to find the index is probably your best bet. Commented Mar 12, 2013 at 14:09
  • @mgilton: are there built in binary search algos in scipy/numpy? Commented Mar 12, 2013 at 14:10
  • Yep: numpy.searchsorted Commented Mar 12, 2013 at 14:11
  • Seems to me like you'll need a combo of np.sort, np.argsort and np.searchsorted to accomplish it. Commented Mar 12, 2013 at 14:13
  • 1
    @Jaime, not sure what you mean, you can get the k-nearest neighbours for points outside the set you query with it. tree = KDTree(x[:,None]); tree.query(y[:,None], k=1) finds the nearest x for all y (based on quadratic norm, you can change that). Commented Mar 12, 2013 at 15:25

1 Answer 1

9

EDIT 2 A solution using KDTree can perform very well if you can choose a number of neighbors that guarantees that you will have a unique neighbor for every item in your array. With the following code:

def nearest_neighbors_kd_tree(x, y, k) :
    x, y = map(np.asarray, (x, y))
    tree =scipy.spatial.cKDTree(y[:, None])    
    ordered_neighbors = tree.query(x[:, None], k)[1]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    nearest_neighbor.fill(-1)
    used_y = set()
    for j, neigh_j in enumerate(ordered_neighbors) :
        for k in neigh_j :
            if k not in used_y :
                nearest_neighbor[j] = k
                used_y.add(k)
                break
    return nearest_neighbor

and a sample of n=1000 points, I get:

In [9]: np.any(nearest_neighbors_kd_tree(x, y, 12) == -1)
Out[9]: True

In [10]: np.any(nearest_neighbors_kd_tree(x, y, 13) == -1)
Out[10]: False

So the optimum is k=13, and then the timing is:

In [11]: %timeit nearest_neighbors_kd_tree(x, y, 13)
100 loops, best of 3: 9.26 ms per loop

But in the worst case, you could need k=1000, and then:

In [12]: %timeit nearest_neighbors_kd_tree(x, y, 1000)
1 loops, best of 3: 424 ms per loop

Which is slower than the other options:

In [13]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 60 ms per loop

In [14]: %timeit nearest_neighbors_sorted(x, y)
10 loops, best of 3: 47.4 ms per loop

EDIT Sorting the array before searching pays off for arrays of more than 1000 items:

def nearest_neighbors_sorted(x, y) :
    x, y = map(np.asarray, (x, y))
    y_idx = np.argsort(y)
    y = y[y_idx]
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.searchsorted(y, xj)
        if idx == len(y) or idx != 0 and y[idx] - xj > xj - y[idx-1] :
            idx -= 1
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)
    return nearest_neighbor

With a 10000 element long array:

In [2]: %timeit nearest_neighbors_sorted(x, y)
1 loops, best of 3: 557 ms per loop

In [3]: %timeit nearest_neighbors(x, y)
1 loops, best of 3: 1.53 s per loop

For smaller arrays it performs slightly worse.


You are going to have to loop over all your items to implement your greedy nearest neighbor algorithm, if only to discard duplicates. With that in mind, this is the fastest I have been able to come up with:

def nearest_neighbors(x, y) :
    x, y = map(np.asarray, (x, y))
    y = y.copy()
    y_idx = np.arange(len(y))
    nearest_neighbor = np.empty((len(x),), dtype=np.intp)
    for j, xj in enumerate(x) :
        idx = np.argmin(np.abs(y - xj))
        nearest_neighbor[j] = y_idx[idx]
        y = np.delete(y, idx)
        y_idx = np.delete(y_idx, idx)

    return nearest_neighbor

And now with:

n = 1000
x = np.random.rand(n)
y = np.random.rand(2*n)

I get:

In [11]: %timeit nearest_neighbors(x, y)
10 loops, best of 3: 52.4 ms per loop
Sign up to request clarification or add additional context in comments.

6 Comments

thanks. Is there a way to do it without duplicates using cKDTree? Even with slight performance hit?
another question: is there a way to ensure that p.argmin(np.abs(y - xj)) will ignore missing values like NaN? Is there ever a case where it will pick those?
np.nanargmin is what you want.
@jaime all 3 return errors in 2022: ---------------------------------------------------------------------------nearest_neighbors_kd_tree: ckdtree.pyx in scipy.spatial.ckdtree.cKDTree.__init__() ValueError: data must be 2 dimensions--------------------- nearest_neighbors(x, y): ValueError: operands could not be broadcast together with shapes (127,) (2,) ---------------------nearest_neighbors_sorted(x,y): ValueError: object too deep for desired array
OK, removing the "[:, None]"s from the KDTree solution runs without errors. However, there's no way to avoid duplicates without running the whole thing. On the full dataset, I hit memory issues MemoryError: Unable to allocate 155. GiB for an array with shape (144008, 144008) and data type float64
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.