I have seen this question on StackOverflow, and I do want to solve the problem of clustering binary images such as this one from the same abovementioned question using sparse matrix representation:

I know that there are more simple and efficient methods for clustering (KMeans, Mean-shift, etc...), and I'm aware that this problem can be solved with other solutions such as connected components,
but my aim is to use sparse matrix representation approach.
What I have tried so far is:
- Reading the image and defining distance function (Euclidean distance as an example):
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import cv2
import numpy as np
from math import sqrt
from scipy.sparse import csr_matrix
original = cv2.imread("km.png", 0)
img = original.copy()
def distance(p1, p2):
"""
Euclidean Distance
"""
return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
- constructing the Sparse Matrix depending on the nonzero points of the binary image, I'm considering two clusters for simplicity, and later it can be extended to
kclusters:
data = np.nonzero(img)
data = tuple(zip(data[0],data[1]))
data = np.asarray(data).reshape(-1,2)
# print(data.shape)
l = data.shape[0] # 68245
# num clusters
k = 2
cov_mat = csr_matrix((l, l, k), dtype = np.int8).toarray()
- First Loop to get the centroids as the most far points from each other:
# inefficient loop!
for i in range(l):
for j in range(l):
if(i!=j):
cov_mat[i, j, 0] = distance(data[i], data[j])
# Centroids
# TODO: check if more than two datapoints with same max distance then take first!
ci = cov_mat[...,0].argmax()
- Calculating the distance from each centroid:
# inefficient loop!
# TODO: repeat for k clusters!
for i in range(l):
for j in range(l):
if(i!=j):
cov_mat[i, j, 0] = distance(data[i], data[ci[0]])
cov_mat[i, j, 1] = distance(data[i], data[ci[1]])
- Clustering depending on min distance:
# Labeling
cov_mat[cov_mat[...,0]>cov_mat[...,1]] = +1
cov_mat[cov_mat[...,0]<cov_mat[...,1]] = -1
# Labeling Centroids
cov_mat[ci[0], ci[0]] = +1
cov_mat[ci[1], ci[1]] = -1
- obtaining the indices of the clusters:
# clusters Indicies
cl1 = cov_mat[...,0].argmax()
cl2 = cov_mat[...,0].argmin()
# TODO: pass back the indices to the image to cluster it.
This approach is time costly, can you please tell me how to increase the efficiency please? thanks in advance.

np.linalg.norm2*68245**2 = 9_314_760_050items which should take ~70 GiB in RAM. This is really HUGE, hence slow to create/fill/read. Do you have at least 128 GiB of RAM on your machine?numpyarray directly!cov_matmatrix with the 9_314_760_050 items in your current code as a reference (ie. the "inefficient loop"). The fact that this matrix is a sparse matrix or not does not change the fact that it is huge in RAM (unless the distance are mostly 0 which is very unlikely). Sparse matrices are only good to use when most of values are 0. This is apparently not the case in the provided code, so they are less efficient (both in RAM space and speed).