0

I'm new at using scikit-learn and I'm trying to clusterize people given their interest in movie. I create a sparse matrix that got different columns (one for each movie) and rows. For a given cell it's 0 or 1 if the user liked the movie or not.

sparse_matrix = numpy.zeros(shape=(len(list_user), len(list_movie)))
for id in list_user:
    index_id = list_user.index(id)
    for movie in list_movie[index_id]:
        if movie.isdigit():
            index_movie = list_movie.index(int(movie))
            sparse_matrix[index_id][index_movie] = 1
pickle.dump(sparse_matrix, open("data/sparse_matrix", "w+"))
return sparse_matrix

I consider this like an array of vectors and from the doc this is an acceptable input.

Perform DBSCAN clustering from vector array or distance matrix.

Link to the citation

So I try to do some thing to use scikit-learn:

sparse_matrix = pickle.load(open("data/sparse_matrix"))
X = StandardScaler().fit_transform(sparse_matrix)
db = DBSCAN(eps=1, min_samples=20).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
print labels

I did this using the example DBSCAN from scikit-learn. I have two question, the first one is: "is my matrix well formatted and suitable for this algorithm?" I've got this concern when I consider the number of dimension. The second question is "how I set the epsilon parameter (minimal distance between my point)?"

1 Answer 1

2

See the DBSCAN article for a suggestion how to choose epsilon based on the k-distance graph.

Since your data is sparse, it probably is more appropriate to use e.g. Cosine distance rather than Euclidean distance. You should also use a sparse format. For all I know, numpy.zeros will create a dense matrix:

 sparse_matrix = numpy.zeros(...)

is therefore misleading, because it is a dense matrix, just with mostly 0s.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your response. But if I run my algorithm as is. Will I get a result a little bit coherent or the fact that I don't use a sparse matrix and not the Cosine distance will make my result completely useless?
Maybe it works, maybe not. StandardScaler is also meant for continuous dense data. It won't crash. But you probably get better results with a Cosine based approach. And you don't know epsilon yet, so you will have to do something more...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.