First n elements of row in numpy array

Question

I'm trying to implement a k-nearest neighbour classifier in Python, and so I want to calculate the Euclidean distance. I have a dataset that I have converted into a big numpy array

[[  0.   0.   4. ...,   1.   0.   1.]
 [  0.   0.   5. ...,   0.   0.   1.]
 [  0.   0.  14. ...,  16.   9.   1.]
 ..., 
 [  0.   0.   3. ...,   2.   0.   3.]
 [  0.   1.   7. ...,   0.   0.   3.]
 [  0.   2.  10. ...,   0.   0.   3.]]

where the last element of each row indicates the class. So when calculating the Euclidean distance, I obviously don't want to include the last element. I thought I could do the following

for row in dataset:
    distance = euclidean_distance(vector, row[:dataset.shape[1] - 1])

but that still includes the last element

print row
>>> [[  0.   0.   4. ...,   1.   0.   1.]]
print row[:dataset.shape[1] - 1]
>>> [[  0.   0.   4. ...,   1.   0.   1.]]

as you can see both are the same.

James · Accepted Answer · 2016-12-01 18:21:48Z

6

You can subset the data using numpy slicing. If you find yourself iterating over a numpy array, stop and try to find a method that takes advantage of the vectorized nature of numpy operations.

Assuming your array is called arr:

data_points = arr[:,:-1]
classes = arr[:,-1]

For distance to vector calculations:

To find the distance between a 1d array and all of the rows of a 2d array, you can use to following. It assumes the 1d array is v and the 2d array is arr.

dist = np.power(arr - v, 2).sum(axis=1)

dist will be a 1d array of distances.

For pairwise calculations:

The following function takes a 2d array of numbers and returns the upper-diagonal matrix of pair-wise distances using the given L-x distance measurement (the Euclidean distance measure is the L=2 metric).

def pairwise_distance(arr, L=2):
    d = arr.shape[0]
    out = np.zeros(d)
    for f in range(1, d):
        out[:-f].ravel()[f::d+1] = np.power(arr[:-f]-arr[f:], L).sum(axis=1)
    return np.power(out, 1.0/L)

edited Dec 1, 2016 at 18:21

answered Dec 1, 2016 at 13:52

James

37k4 gold badges54 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user5368737 Over a year ago

My idea was to create a Counter with the index of each row as a key and the Euclidean distance as a value, and then get the k indices with the lowest distance and return these. How could I do this without iterating over the dataset / is there a better way than doing this with a Counter?

James Over a year ago

I added a method for calculating the pairwise distances. Please remember to mark the question as answered.

user5368737 Over a year ago

I didn't ask that though, I asked how I could calculate this for each row in a numpy array without iterating over it since you said you shouldn't iterate over a numpy array. P.S. you can simply use np.linalg.norm(a - b) where a and b are numpy arrays to calculate the Euclidean distance.

James Over a year ago

Ok, I reread your question. Yup I totally missed what you were asking. I think I have answered it now.

Collectives™ on Stack Overflow

First n elements of row in numpy array

1 Answer 1

For distance to vector calculations:

For pairwise calculations:

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

For distance to vector calculations:

For pairwise calculations:

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related