10

I have a set of data clustering into k groups, each cluster has a minimum size constraint of m

I've done some reclustering of the data. So now I got this set of points that each one has one or more better clusters to be in, but cannot be switched individually because it'll violate the size constraint.

Goal: minimize the sum of distance from each point to its cluster center.

Subject to: Minimum cluster size m

I want to find an algorithm to reassign all points without violating the constraint, while guaranteed to decrease the objective.

I thought of using Graph to represent pair wise relationship between points. But I'm not sure how to do the reassignment since it exists the possibility of a big dense cycle, and I got lost in swapping multiple points between multiple clusters.

I also created a list of cluster pairs with possible swapping candidates, but still couldn't find the way to optimal the objective.

I hope I explained my situation. I'm new to algorithm, and not familiar with the jargon and rules. If any other information needed, please let me know.

I've done a lot of research, I've tried out the algorithm in this paper, but without success, since the sum of membership degree doesn't necessarily correlate to cluster size. Clustering with Size Constraints

I've also read other similar posts on SO, but didn't find a detail algorithm I could implement.

I've tried to construct a weighted directed graph, with vertex representing clusters and edges from A to B represents points in cluster A willing to relocate to cluster B. and weight to be the sum of points

But with my data, it turns out all the nodes to be in a huge cycle with very dense edges. Because of my limited experience, I still could not figure out how to reassign among so many clusters. Any suggestions are appreciated!

Something like this.
enter image description here

3
  • 1
    Have a look at this over at CrossValidated. Commented May 7, 2015 at 22:03
  • I actually tried out the algorithm in the paper. Maybe I did something wrong. But somehow I didn't get the desired result. Since the sum of membership is not necessary related to the cluster size. For example u_{1} = [0.45, 0.15, 0.4], u_{2} = [0.45, 0.3, 0.25] and u_{3} = [0.1, 0.55, 0.35] the cluster is still unbalanced. Commented May 7, 2015 at 22:13
  • "size constraints" is too unspecific. Do you mean as in stackoverflow.com/questions/5452576/… stackoverflow.com/questions/8796682/…? Commented May 7, 2015 at 22:25

3 Answers 3

4

To get a minimal (unfortunately not minimum) solution:

  1. First, greedily recluster any points that you can without violating the minimum size constraint.
  2. Next, build a directed multigraph as follows:
    1. Every cluster becomes a node.
    2. An edge (A,B) is added for every point in A that is closer to the center of B (so that there are potentially multiple edges between the same pair of nodes); its weight should be proportional to the benefit gained by moving it.
  3. Looking for cycles in this graph will allow you to find moves (where a move consists of moving every vertex in the cycle).
  4. Pick the cycle with the highest total weight and recluster the nodes corresponding to the edges.
  5. Repeat steps 1-4 until there are no more cycles.

Creating the graph will have a complexity of O(kn), where you have k and n total points, and can create the same number of multiedges. Tarjan's algorithm will have complexity of O(k2), assuming that you skip multiedges to the same destination in the DFS. Every time you eliminate a cycle, you decrease the global distance by some amount and remove at least one edge from the graph, so the total time of the algorithm cannot exceed O(k4m2). That is quite extravagant; I'm sure there could be heuristics (and probably more detailed analysis) to improve the lower bound.

Sign up to request clarification or add additional context in comments.

13 Comments

Sorry, but I doubt this is correct. The existence of a cycle does not imply a potential move. Even if edges from both A to B and B to A exist, then moving a vertex from A to B might destroy the other edge.
@AmiTavory You are thinking of doing one operation and then the other. Instead, think of unclustering p in A and q in B, and then reclustering them (one to A and one to B); if p was closer to B and q was closer to A before the unclustering, the global distance will be minimized if p goes to B and q goes to A. You have to consider the entire cycle as a move, not the individual reclusterings.
Then imagine an isosceles triangle, with A and B forming the base, and C the apex. Furthermore, there is a cycle ABC. Furthermore, the AB edge is caused by a top-right element in A having an affinity to B. However, its move to B lowers A, so I don't see why it would follow that it doesn't destroy CA.
@AmiTavory It will certainly destroy CA. The global solution would be more reduced if you then did not move any point from C to A. However, this violates the hard constraint that A must have some minimum size, so you must pick the least harmful point in B or C and move it to A; because there was previously a cycle, we know that this least damaging point is not the point that just left A.
Sorry, I don't think so. 1. It will not certainly destroy CA, only possibly. More importantly, 2. There is no reason to assume that any such transfer would necessarily violate the minimal size requirement of one of the clusters.
|
3

This problem is addressed in this paper:

Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. "Constrained k-means clustering." Microsoft Research, Redmond (2000): 1-8.

We propose explicitly adding $k$ constraints to the underlying clustering optimization problem requiring that cluster $h$ contain at least $\tau_h$ points.

I have an implementation of the algorithm in python.

Comments

3

Try this: pip install k-means-constrained and then

from k_means_constrained import KMeansConstrained
KMeansConstrained(n_clusters=8, size_min=None, size_max=None, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=False, random_state=None, copy_x=True, n_jobs=1)

sources:

https://pypi.org/project/k-means-constrained/

https://joshlk.github.io/k-means-constrained/

1 Comment

You need to specify the minimum cluster size using the size_min parameter. For example if it the minim size is 10 and input data is X: clf = KMeansConstrained(n_clusters=8, size_min=10); labels = clf.fit_predict(X)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.