4

I have a list of points in a numpy matrix,

A = [[x11,x12,x13],[x21,x22,x23] ]

and I have a point origin o= [o1,o2,o3] from which I have to compute distance for every point,

A - o will subtract o from every point. Currently I have to do the square of every attribute and addition operation, I am doing in the for loop. Is there a more intuitive way to do this?

P.S: I am doing the above calculation as port of kmeans clustering application. I have computed centroids and now I have to computer distance for every point from the centroid.

input_mat = input_data_per_minute.values[:,2:5]

scaled_input_mat = scale2(input_mat)

k_means = cluster.KMeans(n_clusters=5)

print 'training start'
k_means.fit(scaled_input_mat)
print 'training over'

out = k_means.cluster_centers_

I have to compute the distance between input_mat and each cluster centroid.

1
  • Look into cdist from scipy. Commented Mar 12, 2016 at 8:12

2 Answers 2

3

Numpy solution:

Numpy is great with broadcasting so you can trick it to do all distances in one step. But it will consume a lot of memory depending on the number of points and cluster centers. In fact it will create a number_of_points * number_of_cluster_centers * 3 array:

First you need to know a bit about broadcasting, I'll play it self and define each dimension by hand.

I'll start by defining some points and centers for illustration purposes:

import numpy as np

points = np.array([[1,1,1],
                   [2,1,1],
                   [1,2,1],
                   [5,5,5]])

centers = np.array([[1.5, 1.5, 1],
                    [5,5,5]])

Now I'll prepare these arrays so that I can use numpy broadcasting to get the distance in each dimension:

distance_3d = points[:,None,:] - centers[None,:,:]

Effectivly the first dimension is now the points "label", the second dimension is the centers "label" and the third dimension is the coordinate. The subtraction is to get the distance in each dimension. The result will have a shape:

(number_of_points, number_of_cluster_centers, 3)

now it's only a matter of applying the formula of the euclidean distance:

# Square each distance
distance_3d_squared = distance_3d ** 2

# Take the sum of each coordinates distance (the result will be 2D)
distance_sum = np.sum(distance_3d_squared, axis=2)

# And take the square root
distance = np.sqrt(distance_sum)

For my test data the final result is:

#array([[ 0.70710678,  6.92820323],
#       [ 0.70710678,  6.40312424],
#       [ 0.70710678,  6.40312424],
#       [ 6.36396103,  0.        ]])

So the distance[i, j] element will give you the distance of point i to the center j.

Summary:

You can put all of this in one-line:

distance2 = np.sqrt(np.sum((points[:,None,:] - centers[None,:,:]) ** 2, axis=2))

Scipy solution (faster & shorter):

or if you have scipy use cdist:

from scipy.spatial.distance import cdist
distance3 = cdist(points, centers)

The result will always be the same but cdist is the fastest for lots of points and centers.

Sign up to request clarification or add additional context in comments.

Comments

0

You should be able to do something like this: (assuming I read your question right ;) )

In [1]: import numpy as np

In [2]: a = np.array([[11,12,13],[21,22,23]])

In [3]: o = [1,2,3]

In [4]: a - o  # just showing
Out[4]: 
array([[10, 10, 10],
       [20, 20, 20]])

In [5]: a ** 2  # just showing
Out[5]: 
array([[121, 144, 169],
       [441, 484, 529]])

In [6]: b = (a ** 2) + (a - o)

In [7]: b
Out[7]: 
array([[131, 154, 179],
       [461, 504, 549]])

Numpy is great because it moves through the array element-wise! This means that 90+% of the time you can iterate the array without a for-loop. Using a for-loop outside of the array also significantly slower.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.