Algorithm for clustering with minimum size constraints

Question

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m

I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched individually because it'll violate the size constraint.

Goal: minimize the sum of distance from each point to its cluster center.

Subject to: Minimum cluster size m

I want to find an algorithm to reassign all points without violating the constraint, while guaranteed to decrease the objective.

I thought of using Graph to represent pair wise relationship between points. But I'm not sure how to do the reassignment since it exists the possibility of a big dense cycle, and I got lost in swapping multiple points between multiple clusters.

I also created a list of cluster pairs with possible swapping candidates, but still couldn't find the way to optimal the objective.

I hope I explained my situation. I'm new to algorithm, and not familiar with the jargon and rules. If any other information needed, please let me know.

I've done a lot of research, I've tried out the algorithm in this paper, but without success, since the sum of membership degree doesn't necessarily correlate to cluster size. Clustering with Size Constraints

I've also read other similar posts on SO, but didn't find a detail algorithm I could implement.

I've tried to construct a weighted directed graph, with vertex representing clusters and edges from A to B represents points in cluster A willing to relocate to cluster B. and weight to be the sum of points

But with my data, it turns out all the nodes to be in a huge cycle with very dense edges. Because of my limited experience, I still could not figure out how to reassign among so many clusters. Any suggestions are appreciated!

Something like this.
enter image description here

I actually tried out the algorithm in the paper. Maybe I did something wrong. But somehow I didn't get the desired result. Since the sum of membership is not necessary related to the cluster size. For example u_{1} = [0.45, 0.15, 0.4], u_{2} = [0.45, 0.3, 0.25] and u_{3} = [0.1, 0.55, 0.35] the cluster is still unbalanced. — qshng
– qshng, Commented May 7, 2015 at 22:13
"size constraints" is too unspecific. Do you mean as in stackoverflow.com/questions/5452576/… stackoverflow.com/questions/8796682/…? — Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse, Commented May 7, 2015 at 22:25

Kittsil · Accepted Answer · 2015-05-08 17:15:04Z

4

To get a minimal (unfortunately not minimum) solution:

First, greedily recluster any points that you can without violating the minimum size constraint.
Next, build a directed multigraph as follows:
1. Every cluster becomes a node.
2. An edge (A,B) is added for every point in A that is closer to the center of B (so that there are potentially multiple edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
Looking for cycles in this graph will allow you to find moves (where a move consists of moving every vertex in the cycle).
Pick the cycle with the highest total weight and recluster the nodes corresponding to the edges.
Repeat steps 1-4 until there are no more cycles.

Creating the graph will have a complexity of O(kn), where you have k and n total points, and can create the same number of multiedges. Tarjan's algorithm will have complexity of O(k²), assuming that you skip multiedges to the same destination in the DFS. Every time you eliminate a cycle, you decrease the global distance by some amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O(k⁴m²). That is quite extravagant; I'm sure there could be heuristics (and probably more detailed analysis) to improve the lower bound.

edited May 8, 2015 at 17:15

answered May 7, 2015 at 22:15

Kittsil

2,50715 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

Ami Tavory Over a year ago

Sorry, but I doubt this is correct. The existence of a cycle does not imply a potential move. Even if edges from both A to B and B to A exist, then moving a vertex from A to B might destroy the other edge.

Kittsil Over a year ago

@AmiTavory You are thinking of doing one operation and then the other. Instead, think of unclustering p in A and q in B, and then reclustering them (one to A and one to B); if p was closer to B and q was closer to A before the unclustering, the global distance will be minimized if p goes to B and q goes to A. You have to consider the entire cycle as a move, not the individual reclusterings.

Ami Tavory Over a year ago

Then imagine an isosceles triangle, with A and B forming the base, and C the apex. Furthermore, there is a cycle ABC. Furthermore, the AB edge is caused by a top-right element in A having an affinity to B. However, its move to B lowers A, so I don't see why it would follow that it doesn't destroy CA.

Kittsil Over a year ago

@AmiTavory It will certainly destroy CA. The global solution would be more reduced if you then did not move any point from C to A. However, this violates the hard constraint that A must have some minimum size, so you must pick the least harmful point in B or C and move it to A; because there was previously a cycle, we know that this least damaging point is not the point that just left A.

Ami Tavory Over a year ago

Sorry, I don't think so. 1. It will not certainly destroy CA, only possibly. More importantly, 2. There is no reason to assume that any such transfer would necessarily violate the minimal size requirement of one of the clusters.

|

Behrouz Babaki · Accepted Answer · 2017-02-10 14:11:02Z

3

This problem is addressed in this paper:

Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.

We propose explicitly adding $k$ constraints to the underlying clustering optimization problem requiring that cluster $h$ contain at least $\tau_h$ points.

I have an implementation of the algorithm in python.

answered Feb 10, 2017 at 14:11

Behrouz Babaki

1013 bronze badges

Comments

afsharov · Accepted Answer · 2020-07-26 06:24:36Z

3

Try this: pip install k-means-constrained and then

from k_means_constrained import KMeansConstrained
KMeansConstrained(n_clusters=8, size_min=None, size_max=None, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=False, random_state=None, copy_x=True, n_jobs=1)

sources:

https://pypi.org/project/k-means-constrained/

https://joshlk.github.io/k-means-constrained/

edited Jul 26, 2020 at 6:24

afsharov

5,1842 gold badges14 silver badges29 bronze badges

answered Jul 25, 2020 at 6:02

farzaneh najafi

311 bronze badge

1 Comment

joshlk Over a year ago

You need to specify the minimum cluster size using the size_min parameter. For example if it the minim size is 10 and input data is X: clf = KMeansConstrained(n_clusters=8, size_min=10); labels = clf.fit_predict(X)

Collectives™ on Stack Overflow

Algorithm for clustering with minimum size constraints

3 Answers 3

13 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

13 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related