Effective clustering algorithm

Question

I need help (preferably a full algorithm, but any hint or reference will be appreciated) with the following algorithmic problem:

We have a set of N elements. I can define a distance between any two elements, which satisfies the metric conditions. I need to group these elements into disjoint subsets (each element belonging to exactly one subset) according to the following rules:

The maximum distance between any two elements in each subset does not exceed specified threshold.
The number of the subsets is as small as possible.
If there is more than one possible grouping satisfying conditions (1) and (2), the maximum distance between any two elements in each subset should be as small as possible.

Example:

Assume we have the following points on a number axis: 1, 11, 12, 13, 23. The distance is simple the the difference between the points. Our distance threshold is 10. The two possible grouping satisfying conditions (1) and (2) are: (1, 11), (12), (13, 23) or (1), (11, 12, 13), (23). However, the condition (3) says that the latter grouping is the correct one.

What problems are you running into when trying to do this in the straight-forward way? — 500 - Internal Server Error
– 500 - Internal Server Error, Commented Nov 16, 2017 at 21:46
@500 - Internal Server Error, what do you mean by straightforward way? — Maciek D.
– Maciek D., Commented Nov 16, 2017 at 22:37
Just saw it's 1d-only. So while not answering your question, here some start, at least in regards to what to look for. — sascha
– sascha, Commented Nov 16, 2017 at 23:10
One approach (as nearly always) would be Mixed-integer-programming, which should be able to produce good approximations given short time and exact results given more time. But given that there is no language and an algorithm is asked for, i won't doing a python-based implementation. — sascha
– sascha, Commented Nov 16, 2017 at 23:34

Has QUIT--Anony-Mousse · Accepted Answer · 2017-11-17 07:13:45Z

1

In 1 dimensional data, sort your data, and divide into the desired number of bins, then move bin boundaries to optimize.

It gets more interesting in higher dimensionality. There, the problem will be NP hard. So finding the optimum will be expensive. You can indeed use clustering here: use complete-linkage clustering. For a O(n²) and O(n) memory approach, try CLINK. But in my experience, you will need to run this algorithm several times, on shuffled data, to get a good solution.

answered Nov 17, 2017 at 7:13

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Effective clustering algorithm

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related