0

I have seen this question on StackOverflow, and I do want to solve the problem of clustering binary images such as this one from the same abovementioned question using sparse matrix representation:

bw

I know that there are more simple and efficient methods for clustering (KMeans, Mean-shift, etc...), and I'm aware that this problem can be solved with other solutions such as connected components,

but my aim is to use sparse matrix representation approach.

What I have tried so far is:

  1. Reading the image and defining distance function (Euclidean distance as an example):
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import cv2
import numpy as np
from math import sqrt
from scipy.sparse import csr_matrix

original = cv2.imread("km.png", 0)

img = original.copy()

def distance(p1, p2):
    """
    Euclidean Distance
    """
    return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
  1. constructing the Sparse Matrix depending on the nonzero points of the binary image, I'm considering two clusters for simplicity, and later it can be extended to k clusters:
data = np.nonzero(img)
data = tuple(zip(data[0],data[1]))
data = np.asarray(data).reshape(-1,2)
# print(data.shape)

l = data.shape[0] # 68245

# num clusters
k = 2
cov_mat = csr_matrix((l, l, k), dtype = np.int8).toarray()
  1. First Loop to get the centroids as the most far points from each other:
# inefficient loop!
for i in range(l):
    for j in range(l):
        if(i!=j):
            cov_mat[i, j, 0] = distance(data[i], data[j])
# Centroids
# TODO: check if more than two datapoints with same max distance then take first!
ci = cov_mat[...,0].argmax()
  1. Calculating the distance from each centroid:
# inefficient loop!
# TODO: repeat for k clusters!
for i in range(l):
    for j in range(l):
        if(i!=j):
            cov_mat[i, j, 0] = distance(data[i], data[ci[0]])
            cov_mat[i, j, 1] = distance(data[i], data[ci[1]])
  1. Clustering depending on min distance:
# Labeling
cov_mat[cov_mat[...,0]>cov_mat[...,1]] = +1
cov_mat[cov_mat[...,0]<cov_mat[...,1]] = -1
# Labeling Centroids
cov_mat[ci[0], ci[0]] = +1
cov_mat[ci[1], ci[1]] = -1
  1. obtaining the indices of the clusters:
# clusters Indicies
cl1 = cov_mat[...,0].argmax()
cl2 = cov_mat[...,0].argmin()

# TODO: pass back the indices to the image to cluster it.

This approach is time costly, can you please tell me how to increase the efficiency please? thanks in advance.

6
  • 2
    yes, don't write your own loops. use functions (primarily numpy) that operate on the entire dataset at once, not on individual points. every time you have a loop that runs over every element of a sequence/list/array, think long and hard about how to get rid of it. -- first hint: np.linalg.norm Commented Jan 15, 2022 at 11:54
  • Why do you want to use a sparse matrix in the first place? It will make your loop insanely slow and prevent any vectorization with Numpy. Since Scipy it not very optimized too, computating sparse matrix with Scipy function will not be much better (at least probably much better than loops). Working on sparse matrices in Python is a mess in term of performance. Native languages like C/C++ can do that far more efficiently. Commented Jan 15, 2022 at 13:47
  • Moreover, building fully the covariance matrix is the actual problem here: it generate a matrix of 2*68245**2 = 9_314_760_050 items which should take ~70 GiB in RAM. This is really HUGE, hence slow to create/fill/read. Do you have at least 128 GiB of RAM on your machine? Commented Jan 15, 2022 at 13:57
  • 1
    @JérômeRichard "Why do you want to use a sparse matrix in the first place?" because they can be executed on GPUs according to this Article which uses a close approach with sparse matrices to do binary segmentation, Regarding the size of RAM allocation, I don't want to convert this sparse matrix into dense one, and I believe that is the core concept of sparsity, otherwise, we would use numpy array directly! Commented Jan 16, 2022 at 20:10
  • 1
    Ok. For the RAM usage, I just considered the loop that fill the cov_mat matrix with the 9_314_760_050 items in your current code as a reference (ie. the "inefficient loop"). The fact that this matrix is a sparse matrix or not does not change the fact that it is huge in RAM (unless the distance are mostly 0 which is very unlikely). Sparse matrices are only good to use when most of values are 0. This is apparently not the case in the provided code, so they are less efficient (both in RAM space and speed). Commented Jan 16, 2022 at 20:49

1 Answer 1

2

As pointed out in the comments, when working with NumPy arrays, vectorised code should be preferred over scalar code (i.e., code with explicit for loops). Besides, as I see it, in this particular problem a dense NumPy array is a better choice than a Scipy sparse array. If you are not constrained to utilize a sparse array, this code would do:

import numpy as np
from skimage.io import imread
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

n_clusters = 2
palette = [[255, 0, 0], [0, 255, 0]]

img = imread('https://i.sstatic.net/v1NDe.png')
rows, cols = img.nonzero()
coords = np.stack([rows, cols], axis=1)

labels = KMeans(n_clusters=n_clusters, random_state=0).fit_predict(coords)

result = np.zeros(shape=img.shape+(3,), dtype=np.uint8)
for label in range(n_clusters):
    idx = (labels == label)
    result[rows[idx], cols[idx]] = palette[label]

fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 8))
ax0.imshow(img, cmap='gray')
ax0.set_axis_off()
ax0.set_title('Binary image')
ax1.imshow(result)
ax1.set_axis_off()
ax1.set_title('Cluster labels');

Results

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.