1

I am running DBSCAN algorithm in Python on a dataset (modelled very similar to http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html and loaded as a pandas dataframe) that has a total of ~ 3 million datapoints, across 31 days. Further, I do density clustering to find outliers on a per day basis, so db = DBSCAN(eps=0.3, min_samples=10).fit(data) will just have a day worth of data-points to run on, in each pass. The minimum/maximum points I have on any day is 15809 & 182416. I tried deleting the variables, but the process gets killed at the DBSCAN clustering stage.

  1. At O(n log n) this obviously bloats up, no matter where I run it. I understand there is no way to pre-specify the number of "labels", or clusters - what else is a the best here?

  2. Also, from an optimization point of view, some of the values of these data points will be exact (think of these as cluster points that are repeated) - can I use this information to process the data ahead of feeding to DBSCAN?

  3. I read this thread on using "canopy preclustering" to compress your data as in vector quantization ahead of DBSCAN (Note this method is equally expensive computationally) - can I use something similar to pre-process my data? Or how about "parallel DBSCAN"?

1 Answer 1

1

Have you considered to do:

  • partitioning, cluster one day (or less) at a time
  • sampling, break your data set randomly into 10 parts. process them individually
Sign up to request clarification or add additional context in comments.

1 Comment

How would you then join those individual partitions?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.