Input matrix and parameters for the DBSCAN algorithm from scikit-learn

Question

I'm new at using scikit-learn and I'm trying to clusterize people given their interest in movie. I create a sparse matrix that got different columns (one for each movie) and rows. For a given cell it's 0 or 1 if the user liked the movie or not.

sparse_matrix = numpy.zeros(shape=(len(list_user), len(list_movie)))
for id in list_user:
    index_id = list_user.index(id)
    for movie in list_movie[index_id]:
        if movie.isdigit():
            index_movie = list_movie.index(int(movie))
            sparse_matrix[index_id][index_movie] = 1
pickle.dump(sparse_matrix, open("data/sparse_matrix", "w+"))
return sparse_matrix

I consider this like an array of vectors and from the doc this is an acceptable input.

Perform DBSCAN clustering from vector array or distance matrix.

Link to the citation

So I try to do some thing to use scikit-learn:

sparse_matrix = pickle.load(open("data/sparse_matrix"))
X = StandardScaler().fit_transform(sparse_matrix)
db = DBSCAN(eps=1, min_samples=20).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
print labels

I did this using the example DBSCAN from scikit-learn. I have two question, the first one is: "is my matrix well formatted and suitable for this algorithm?" I've got this concern when I consider the number of dimension. The second question is "how I set the epsilon parameter (minimal distance between my point)?"

Has QUIT--Anony-Mousse · Accepted Answer · 2016-04-19 11:52:46Z

2

See the DBSCAN article for a suggestion how to choose epsilon based on the k-distance graph.

Since your data is sparse, it probably is more appropriate to use e.g. Cosine distance rather than Euclidean distance. You should also use a sparse format. For all I know, numpy.zeros will create a dense matrix:

 sparse_matrix = numpy.zeros(...)

is therefore misleading, because it is a dense matrix, just with mostly 0s.

answered Apr 19, 2016 at 11:52

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

mel Over a year ago

Thank you for your response. But if I run my algorithm as is. Will I get a result a little bit coherent or the fact that I don't use a sparse matrix and not the Cosine distance will make my result completely useless?

Has QUIT--Anony-Mousse Over a year ago

Maybe it works, maybe not. StandardScaler is also meant for continuous dense data. It won't crash. But you probably get better results with a Cosine based approach. And you don't know epsilon yet, so you will have to do something more...

Collectives™ on Stack Overflow

Input matrix and parameters for the DBSCAN algorithm from scikit-learn

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related