10

I got Memory Error when I was running dbscan algorithm of scikit. My data is about 20000*10000, it's a binary matrix.

(Maybe it's not suitable to use DBSCAN with such a matrix. I'm a beginner of machine learning. I just want to find a cluster method which don't need an initial cluster number)

Anyway I found sparse matrix and feature extraction of scikit.

http://scikit-learn.org/dev/modules/feature_extraction.html http://docs.scipy.org/doc/scipy/reference/sparse.html

But I still have no idea how to use it. In DBSCAN's specification, there is no indication about using sparse matrix. Is it not allowed?

If anyone knows how to use sparse matrix in DBSCAN, please tell me. Or you can tell me a more suitable cluster method.

1

4 Answers 4

8

The scikit implementation of DBSCAN is, unfortunately, very naive. It needs to be rewritten to take indexing (ball trees etc.) into account.

As of now, it will apparently insist of computing a complete distance matrix, which wastes a lot of memory.

May I suggest that you just reimplement DBSCAN yourself. It's fairly easy, there exists good pseudocode e.g. on Wikipedia and in the original publication. It should be just a few lines, and you can then easily take benefit of your data representation. E.g. if you already have a similarity graph in a sparse representation, it's usually fairly trivial to do a "range query" (i.e. use only the edges that satisfy your distance threshold)

Here is a issue in scikit-learn github where they talk about improving the implementation. A user reports his version using the ball-tree is 50x faster (which doesn't surprise me, I've seen similar speedups with indexes before - it will likely become more pronounced when further increasing the data set size).

Update: the DBSCAN version in scikit-learn has received substantial improvements since this answer was written.

Sign up to request clarification or add additional context in comments.

2 Comments

You might be interested in scikit-learn 0.14, where DBSCAN was indeed modified to use ball trees.
also DBSCAN now supports sparse input in v0.16
1

You can pass a distance matrix to DBSCAN, so assuming X is your sample matrix, the following should work:

from sklearn.metrics.pairwise import euclidean_distances

D = euclidean_distances(X, X)
db = DBSCAN(metric="precomputed").fit(D)

However, the matrix D will be even larger than X: n_samples² entries. With sparse matrices, k-means is probably the best option.

(DBSCAN may seem attractive because it doesn't need a pre-determined number of clusters, but it trades that for two parameters that you have to tune. It's mostly applicable in settings where the samples are points in space and you know how close you want those points to be to be in the same cluster, or when you have a black box distance metric that scikit-learn doesn't support.)

2 Comments

DBSCAN doesn't require the distance matrix, that is a limitation of the current sklearn implementation, not of the algorithm. Plus, in many cases, both the epsion and the minpts parameter of DBSCAN can be chosen much easier than k. When using geographic data for example, a user may well be able to say that a radius of "1 km" is a good epsilon, and that there should be at least 10 events within this radius.
@Anony-Mousse: I'm aware that the problems are in the implementation, not the algorithm. As for picking eps and minpts, yes, for some problems that may be easy, but for others, extensive tuning may be required. Not all problems live in Euclidean space or even on the surface of the earth.
1

Yes, since version 0.16.1. Here's a commit for a test:

https://github.com/scikit-learn/scikit-learn/commit/494b8e574337e510bcb6fd0c941e390371ef1879

Comments

0

Sklearn's DBSCAN algorithm doesn't take sparse arrays. However, KMeans and Spectral clustering do, you can try these. More on sklearns clustering methods: http://scikit-learn.org/stable/modules/clustering.html

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.