2

I'm trying to run a K-Means Clustering algorithm with numpy and python, but keep running into a memory error if I use larger values of K (anything greater than 10 seems to result in the error). I have two numpy arrays of size [42000,784] (the dataset) and [K,784] (the centroids). The memory error occurs when calculating the euclidean distance between each of the centroids and each of the data points. This is the function I have been using:

def dist(a,b):
    a = a[np.newaxis,:,:]
    b = b[:,np.newaxis,:]
    dist = np.linalg.norm((a-b), axis=2) 
    return dist

Is this a memory leak or do I legitimately not have enough memory (I have 8GB)? How can I fix this?

5
  • Note, you're reshaping operations can be more succinctly articulated as a[None,:,:] and b[:,None,:], or less arcanely, you can use np.newaxis like so: b[:,np.newaxis,:] Commented Jan 14, 2018 at 21:57
  • So the final array a-b will contain 42000 * 784 * K elements. Supposing you use float64s (8bytes per value) that will be 251 * K MB. For an imaginary K = 15 that's ~3.7 GB. np.linalg.norm may just use a bit of extra memory but (hopefully) not that much. The really interesting question now is: How much RAM do you have? Commented Jan 14, 2018 at 21:58
  • I have 8GB of RAM, so it should work. I'd like to be able to scale it up to something along the lines of K=256, is there anyway to do this? Closing everything else running will get K=20 to work, but it takes very long. Checking RAM usage shows around 7.5GB used. Commented Jan 14, 2018 at 22:27
  • @juanpa.arrivillaga I fixed my reshaping, does that reduce memory usage? Commented Jan 14, 2018 at 22:39
  • @LunarLlama no, it was an aside. Commented Jan 14, 2018 at 22:49

2 Answers 2

2

scipy has built-in functions for distance computations, which are lightning fast compared to home made implementations.

So, the first idea is to replace your whole distance function by the following expression:

from numpy.random import rand
from scipy.spatial import distance

# sample data
a = randn(42000, 784
b = randn(256, 784)

# distance computation
dist = distance.cdist(a, b, metric='euclidean')    # about 8.02 s on 
                                                   # my 8 GB RAM machine

Note that dist in this example is transposed according to your example. If you want to the shape of your example just do dist = distance.cdist(a, b).T.

It is further possible to speed-up the computation a little by omitting the square root operation. You may accomplish this by dist = distance.cdist(a, b, metric='sqeuclidean').

This whole approach does not greatly reduce memory consumption but it takes the memory only for a few seconds.

The second idea is to not use home made implementations at all, but some reliabe third party packages, like the well-knwon Scikit Learn:

from sklear.cluster import KMeans
a = randn(4200, 200)

km = KMeans(n_clusters=256)
km.fit(a)                    # about 10 s

One of several advantages of this implementation is, that it automatically decides how to compute the distances so that it does not blow your memory.

Sign up to request clarification or add additional context in comments.

Comments

1

As an alternative way, using numba accelerator in nopython parallel mode, can be one of the fastest methods. I have compared performances of numba and cdist on various array sizes, both consume near the same time (e.g. both takes 8 second on my machine), perhaps numba beats cdist on smaller arrays:

import numba as nb

@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def distances_numba(a, b):
    arr = np.zeros((a.shape[0], b.shape[0]), dtype=np.float64)
    temp_arr = np.zeros_like(arr)

    for i in nb.prange(a.shape[0]):
        for j in range(b.shape[0]):
            for k in range(a.shape[1]):
                temp_arr[i, j] += (a[i, k] - b[j, k]) ** 2
            arr[i, j] = temp_arr[i, j] ** 0.5

    return arr

I did not compare memory consumptions, but I think numba will be one of the best in this regard.

Update

We can parallelize the max9111 answer, that was 10-20 times faster than cdist or my first solution. The parallelization makes the max9111 solution faster around 1.5 to 2 times. The benchmarks are based on some of my tests and need more evaluations.

@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def first(A):
    TMP_A = np.zeros(A.shape[0], dtype=np.float64)
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            TMP_A[i] += A[i, j] ** 2
    return TMP_A


@nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def second(B):
    TMP_B = np.zeros(B.shape[0], dtype=np.float64)
    for i in nb.prange(B.shape[0]):
        for j in range(B.shape[1]):
            TMP_B[i] += B[i, j] ** 2
    return TMP_B


@nb.njit("float64[:, ::1](float64[:, ::1], float64[:, ::1])", parallel=True)
def calc_dist_p(A, B):
    dist = np.dot(A, B.T)
    TMP_A = first(A)
    TMP_B = second(B)
    for i in nb.prange(A.shape[0]):
        for j in range(B.shape[0]):
            dist[i, j] = (-2. * dist[i, j] + TMP_A[i] + TMP_B[j]) ** 0.5
    return dist

2 Comments

You can beat cdist by more than a factor of 50 if you change the algorithm by using the following relation (a-b)^2 = a^2 + b^2 - 2ab (less precise): stackoverflow.com/a/53380192/4045774 The idea is to do most of the calculation in a highly optimized BLAS routine. But this is only the case if the size a.shape[1] is relatively large (10-20 and upwards)
@max9111 , Your answer was much better than my first solution and cdist (+1). Based on your solution, I proposed a second solution with parallelizing your code. Now it is faster and may can be a little faster by some tricks, but not very much.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.