Clustering in binary numpy ndarray with cluster size/density variable and exact no. of zeros

Question

I am new to programming and hoping somebody can help me with a specific problem I have.

I want to form clusters in a 100x100 binary numpy ndarray under two conditions:

I want to specify the number of pixels that have value zero and one.
I want to have an input variable that allows me to form larger or smaller clusters.

With the answers on this page I made an ndarray with 300 zeros and 700 ones.

> import numpy as np

> N=1000 
> K=300 

> arr=[0] * K + [1] * (N-K)
> np.random.shuffle(arr)
> arr1=np.resize(arr,(100,100))

I then would like to implement a clustering algorithm that allows me to specify some measure of cluster density or cluster size.

I looked into the scipy.ndimage package but can't seem to find anything useful.

EDIT: To make my question more clear, previously I was using the package nlmpy, which uses numpy to make arrays representing virtual landscapes.

It does this by generating an random array with continues values between [0-1], and using '4-neighbourhood' classification on a subset of pixels. After the clustering of pixels, it uses an interpolate function to assign the remainder of the pixels to one of the clusters.

For example, making clusters with 60% of the pixels:

import nlmpy
nRow=100
nCol=100
arr=nlmpy.randomClusterNN(nRow, nCol, 0.60, n='4-neighbourhood', mask=None)

This gives clusters with values ranging from [0-1]:

I then use a built in function of nlmpy to reclassify this output into a binary ndarray. For example 50% of pixels need to have value '1' and 50% value '0'.

arrBinair= nlmpy.classifyArray(arr, [0.50, 0.50])

Output:

The problem here is that not exactly 50% of the values are '1' or '0' .

print(arrBinair==1).sum()
output: 3023.0

This is because of the nlmpy.randomClusterNN function that first makes different clusters and only then a binary reclassification of the clusters is done.

My question is if a binary clustering landscape can be generated in a faster way, without first clustering in continuous classes and without using the nlmpy package ?

I hope this is enough information ? Or do I need to post the functions 'under the hood' of the nlmpy package ? I hesitate as it is quite a lot of code.

Many thanks.

tel · Accepted Answer · 2018-11-23 16:18:17Z

You can more-or-less get what you want using sklearn.cluster.DBSCAN:

from matplotlib import pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN

def randones(shape, n, dtype=None):
    arr = np.zeros(shape, dtype=dtype)
    arr.flat[np.random.choice(arr.size, size=n, replace=False)] = 1
    return arr

def cluster(arr, *args, **kwargs):
    data = np.array(arr.nonzero()).T
    c = DBSCAN(*args, **kwargs)
    c.fit(data)
    return data, c

# generate random data
shape = (100, 100)
n = 300
arr = randones(shape, n)

# perform clustering
data, c = cluster(arr, eps=6, min_samples=4)

# plot the clusters in different colors
colors = [('C%d' % (i%10)) if i > -1 else 'k' for i in c.labels_]
fig = plt.figure(figsize=(8,8))
ax = fig.gca()
ax.scatter(*data.T, c=colors)

Output:

The minimum number of points in a cluster is defined by the min_samples parameter. You can adjust the minimum density of the identified clusters by twiddling the eps parameter (which defines the maximum distance between two points in a cluster). For example, you can identify larger, less dense clusters by increasing eps:

# perform clustering
data, c = cluster(arr, eps=8, min_samples=4)

If we plot this less-dense clustering in the same way as before, it gives:

Collectives™ on Stack Overflow

Clustering in binary numpy ndarray with cluster size/density variable and exact no. of zeros

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related