3

I'm trying to implement a k-nearest neighbour classifier in Python, and so I want to calculate the Euclidean distance. I have a dataset that I have converted into a big numpy array

[[  0.   0.   4. ...,   1.   0.   1.]
 [  0.   0.   5. ...,   0.   0.   1.]
 [  0.   0.  14. ...,  16.   9.   1.]
 ..., 
 [  0.   0.   3. ...,   2.   0.   3.]
 [  0.   1.   7. ...,   0.   0.   3.]
 [  0.   2.  10. ...,   0.   0.   3.]]

where the last element of each row indicates the class. So when calculating the Euclidean distance, I obviously don't want to include the last element. I thought I could do the following

for row in dataset:
    distance = euclidean_distance(vector, row[:dataset.shape[1] - 1])

but that still includes the last element

print row
>>> [[  0.   0.   4. ...,   1.   0.   1.]]
print row[:dataset.shape[1] - 1]
>>> [[  0.   0.   4. ...,   1.   0.   1.]]

as you can see both are the same.

1 Answer 1

6

You can subset the data using numpy slicing. If you find yourself iterating over a numpy array, stop and try to find a method that takes advantage of the vectorized nature of numpy operations.

Assuming your array is called arr:

data_points = arr[:,:-1]
classes = arr[:,-1]

For distance to vector calculations:

To find the distance between a 1d array and all of the rows of a 2d array, you can use to following. It assumes the 1d array is v and the 2d array is arr.

dist = np.power(arr - v, 2).sum(axis=1)

dist will be a 1d array of distances.


For pairwise calculations:

The following function takes a 2d array of numbers and returns the upper-diagonal matrix of pair-wise distances using the given L-x distance measurement (the Euclidean distance measure is the L=2 metric).

def pairwise_distance(arr, L=2):
    d = arr.shape[0]
    out = np.zeros(d)
    for f in range(1, d):
        out[:-f].ravel()[f::d+1] = np.power(arr[:-f]-arr[f:], L).sum(axis=1)
    return np.power(out, 1.0/L)
Sign up to request clarification or add additional context in comments.

4 Comments

My idea was to create a Counter with the index of each row as a key and the Euclidean distance as a value, and then get the k indices with the lowest distance and return these. How could I do this without iterating over the dataset / is there a better way than doing this with a Counter?
I added a method for calculating the pairwise distances. Please remember to mark the question as answered.
I didn't ask that though, I asked how I could calculate this for each row in a numpy array without iterating over it since you said you shouldn't iterate over a numpy array. P.S. you can simply use np.linalg.norm(a - b) where a and b are numpy arrays to calculate the Euclidean distance.
Ok, I reread your question. Yup I totally missed what you were asking. I think I have answered it now.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.