Optimizing a DBSCAN to run computationally

Question

I am running DBSCAN algorithm in Python on a dataset (modelled very similar to http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html and loaded as a pandas dataframe) that has a total of ~ 3 million datapoints, across 31 days. Further, I do density clustering to find outliers on a per day basis, so db = DBSCAN(eps=0.3, min_samples=10).fit(data) will just have a day worth of data-points to run on, in each pass. The minimum/maximum points I have on any day is 15809 & 182416. I tried deleting the variables, but the process gets killed at the DBSCAN clustering stage.

At O(n log n) this obviously bloats up, no matter where I run it. I understand there is no way to pre-specify the number of "labels", or clusters - what else is a the best here?
Also, from an optimization point of view, some of the values of these data points will be exact (think of these as cluster points that are repeated) - can I use this information to process the data ahead of feeding to DBSCAN?
I read this thread on using "canopy preclustering" to compress your data as in vector quantization ahead of DBSCAN (Note this method is equally expensive computationally) - can I use something similar to pre-process my data? Or how about "parallel DBSCAN"?

Has QUIT--Anony-Mousse · Accepted Answer · 2016-02-06 08:43:32Z

1

Have you considered to do:

partitioning, cluster one day (or less) at a time
sampling, break your data set randomly into 10 parts. process them individually

answered Feb 6, 2016 at 8:43

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

gregory Over a year ago

How would you then join those individual partitions?

Collectives™ on Stack Overflow

Optimizing a DBSCAN to run computationally

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related