1

This is for a K-Means Algorithm. This is for homework, so I do not want to use the built in Kmeans function. I have 2 numpy arrays. One is of centroids. The other is of data points. I am trying to find the distance from each of the centroids to each of the data points. I don't know how to pass the arrays to my function in order for it to print. I want to end up with as many arrays of distances as there are centroids. Then I can compare each distance in the arrays, choose the smallest distance and assign that point to one of the clusters. Then find the mean of each of the clusters, and those numbers become my new centroids.

import numpy as np

centroids = np.array([[3,44],[5,15]])
dataPoints = np.array([[2,4],[17,4],[45,2],[45,7],[16,32],[32,14],[20,56],[68,33]])
def distance(a,b):
    for x in a: #for each point in centroids array
        for y in b:#for each point in the dataPoints array
            print np.sqrt((a[0] - b[0])**2 + (a[1] - b[1])**2)#print the distance

distance (randPoints, dataPoints)#call the function with the data

The output I am getting:

[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]
[ 12.04159458  41.48493703]

What am I doing that is obviously wrong here? I should end up with 2 different arrays with 8 distances each.

6
  • 1
    You keep referring to the same first element of a and b. Use x[i] and y[i] instead a[i] and b[i]. Commented Mar 7, 2017 at 23:46
  • 1
    I'm not sure why you expect 2 containers of 8 things. Look at your loops. You're looping over each element in the 2 elements in centroids, and within that you're looping over each element in the 8 elements of dataPoints. Each time you print. So you're going to print 16 things. Commented Mar 7, 2017 at 23:48
  • By the way, your inner loop can be replaced with print(np.sqrt(((dataPoints-x)**2).sum(axis=1))). Commented Mar 7, 2017 at 23:51
  • @DYZ Your first comment worked like a charm. Now I need to re-evaluate in order to get my different arrays for comparison. The second comment you left gave my an error with 'operands could not be broadcast together with shiapes (8,2) and (2,2) Commented Mar 7, 2017 at 23:54
  • @Denziloe my apologies, it was definitely obvious why I'm not getting the 2 arrays I want here. I am going to try to figure that out now. Any suggestions? Commented Mar 7, 2017 at 23:55

2 Answers 2

2

I got sick of coming up with incarnations for distance calculations for 1, 2 and 3d arrays, so I cobbled together a function that emulates pdist and cdist from scipy, but uses einsum that many people use on this site. It is easy to follow in my mind at least and einsum is versatile for other purposes. So consider the following. You can use then use sorting (sort, argsort etc) if you need to extract closest-x values etc. Hope you find it useful

a = np.array([[1, 2], [3, 4], [5, 6]])
b = np.array([[6, 5], [4, 3], [2, 1]])

def e_dist(a, b, metric='euclidean'):
    """Distance calculation for 1D, 2D and 3D points using einsum
    : a, b   - list, tuple, array in 1,2 or 3D form
    : metric - euclidean ('e','eu'...), sqeuclidean ('s','sq'...), 
    :-----------------------------------------------------------------------
    """
    a = np.asarray(a)
    b = np.atleast_2d(b)
    a_dim = a.ndim
    b_dim = b.ndim
    if a_dim == 1:
        a = a.reshape(1, 1, a.shape[0])
    if a_dim >= 2:
        a = a.reshape(np.prod(a.shape[:-1]), 1, a.shape[-1])
    if b_dim > 2:
        b = b.reshape(np.prod(b.shape[:-1]), b.shape[-1])
    diff = a - b
    dist_arr = np.einsum('ijk,ijk->ij', diff, diff)
    if metric[:1] == 'e':
        dist_arr = np.sqrt(dist_arr)
    dist_arr = np.squeeze(dist_arr)
    return dist_arr

e_dist(a, b)
array([[ 5.8,  3.2,  1.4],
       [ 3.2,  1.4,  3.2],
       [ 1.4,  3.2,  5.8]])

e_dist(a[0], b)
array([ 5.8,  3.2,  1.4])

e_dist(a[:2], b)
array([[ 5.8,  3.2,  1.4],
       [ 3.2,  1.4,  3.2]])
Sign up to request clarification or add additional context in comments.

Comments

1
import numpy as np

centroids = np.array([[3,44],[5,15]])
dataPoints = np.array([[2,4],[17,4],[45,2],[45,7],[16,32],[32,14],[20,56],[68,33]])

def size(vector):
    return np.sqrt(sum(x**2 for x in vector))

def distance(vector1, vector2):
    return size(vector1 - vector2)

def distances(array1, array2):
    return [[distance(vector1, vector2) for vector2 in array2] for vector1 in array1]

print(distances(centroids, dataPoints))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.