Efficient lookup table for collection of numpy arrays

Question

I would like to know what's the most efficient way to create a lookup table for floats (and collection of floats) in Python. Since both sets and dicts need the keys to be hashable, I guess can't use some sort of closeness to check for proximity to already inserted, can I? I have seen this answer and it's not quite what I'm looking for as I don't want to give the burden of creating the right key to the user and also I need to extend it for collections of floats. For example, given the following code:

>>> import numpy as np
>>> a = {np.array([0.01, 0.005]): 1}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'numpy.ndarray'
>>> a = {tuple(np.array([0.01, 0.005])): 1}
>>> tuple(np.array([0.0100000000000001,0.0050002])) in a
False

I would like the last statement to return True. Coming from a C++ world, I would create a std::map and provide a compare function that can do the comparison with some user defined tolerance to check if the values have been added to the data structure. Of course this question extends naturally to the lookup tables of arrays (for example numpy arrays). So what's the most efficient way to accomplish what I'm looking for?

Are you trying to make a mapping from float to list of float? — shanmuga
– shanmuga, Commented Aug 19, 2015 at 13:05
Can't you just provide a custom lookup function (def look_up(dictionary, float_key)) to the user that implements the solution you linked to? — Andy Kubiak
– Andy Kubiak, Commented Aug 19, 2015 at 13:07
I don't want to put any burden on the user. The idea is to have thousands of numpy arrays that represent 3D points so that the creation of new points is avoided if the point is already in the dictionary. But I need to check with respect to some tolerance value (snapping). — aaragon
– aaragon, Commented Aug 19, 2015 at 13:13
@Sid I see that answer, but where are you suggesting? The data structure would still need to hash keys and there's gonna have to be some rounding involved, right? — aaragon
– aaragon, Commented Aug 20, 2015 at 8:25

Bas Swinckels · Accepted Answer · 2015-08-19 15:44:26Z

1

Since you are interested in 3D points, you could think about using some data-structure that is optimized for storing spatial data, such as a KD-tree. This is available in Scipy and allows the lookup of the point closest to a given coordinate. After you have looked up the this point, you could then do a check to see if you are within some tolerance to accept the new point or not.

Usage should be something like this (untested, never used it myself):

from scipy.spatial import KDTree
points = ... # points is [Nx3]
tree = KDTree(points)  
new_point = ... # array of length 3
distance, nearest_index = tree.query(new_point)
if distance > tolerance:  # add point
    points = np.vstack((points, new_point))
    tree = KDTree(points)  # generate tree from scratch

Note that a KD-tree is efficient for looking up a point in a static collection of points (cost of a lookup is O(log(N)), but they are not optimized for repeatedly adding new points. The Scipy implementation even lacks a method to add new points, so you have to generate a new tree every time you insert a new point. Since this operation is probably O(N*log(N)), it might be faster to just to do brute-force calculation of all distances, which costs O(N). Note also that there is an alternative version cKDTree, which might be implemented in C for speed, the documentation is not really clear on this.

edited Aug 19, 2015 at 15:44

answered Aug 19, 2015 at 13:41

Bas Swinckels

18.5k3 gold badges48 silver badges64 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

aaragon Over a year ago

Could you please develop on this? I like the idea.

aaragon Over a year ago

So this is not what I need, because the idea is to add new points to the tree only if they're not present in the tree. But the tree has to be built from scratch (no points).

Bas Swinckels Over a year ago

That is easy to achieve. The first point should be accepted by default, so you do tree = KDTree([first_point]). For all the next points, you use the code I showed and only add them when they are not close to existing points.

aaragon Over a year ago

It's not efficient (as I put in the title of the post). It doesn't make sense to recreate the entire data structure every time I insert a new point, does it? There has to be a better way. The search time is O(log(N)), but the same complexity has a std::map because it's implemented as a black/red tree. I would be surprised if you couldn't do the same thing in Python.

Collectives™ on Stack Overflow

Efficient lookup table for collection of numpy arrays

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related