1

I am using this clustering algorithm to cluster lat and lon points. I am using pre-written code which is given at http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html.

The code is as follows and takes in my file with over 4000 lat and lon points. However I want to adjust this code so that it only defines a cluster as points within say 0.000020 of each other, as I want my clusters to be almost at street level.

At the moment I am getting 11 clusters whereas in theory I want at least 100 clusters.I have tried adjusting and changing different figures but to no avail.

print(__doc__)

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler


##############################################################################
# Generate sample data
input = np.genfromtxt(open("dataset_import_noaddress.csv","rb"),delimiter=",", skip_header=1)
coordinates = np.delete(input, [0,1], 1)

X, labels_true = make_blobs(n_samples=4000, centers=coordinates, cluster_std=0.0000005,
                        random_state=0)

X = StandardScaler().fit_transform(X)

##############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
  % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
  % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
  % metrics.silhouette_score(X, labels))

##############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
         markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
         markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

2 Answers 2

2

You appear to be changing the data generation only:

X, labels_true = make_blobs(n_samples=4000, centers=coordinates, cluster_std=0.0000005,
                    random_state=0)

instead of the clustering algorithm:

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
            ^^^^^^^ almost your complete data set?

For geographic data, make sure to use haversine distance instead of Euclidean distance. Earth is more like a sphere than a flat Euclidean world.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your help, I have changed the eps=0.000010 and I am only getting 28 clusters whereas I was expecting alot more. Have you any suggestions as to why this is.Thanks!
Well, study the data. Adjacent blobs will be merged by DBSCAN. epsilon is not a maximum distance.
0

Code on python, cluster image

from sklearn.cluster import DBSCAN
import numpy as np

# загрузка изображения
img = plt.imread('image.jpg')

# преобразование изображения в массив точек
points = np.reshape(img,(img.shape[0]*img.shape[1], img.shape[2]))

# определение алгоритма DBSCAN с параметрами
dbscan = DBSCAN(eps=10, min_samples=10)

# выполнение кластеризации
labels = dbscan.fit_predict(points)

# преобразование меток кластеров в изображение
clustered_img = np.reshape(labels, (img.shape[0], img.shape[1]))

# вывод изображения
plt.imshow(clustered_img)
plt.show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.