Number clustering/partitioning algorithm

Question

I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two different partitions. This means that the distribution percentages above are not strict, but rather the "goals" or "starting points" if you wish.

For example, let's say my array is ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8].

I choose k = 4 and the numbers should be distributed into partitions A, B, C and D with percentages pA = pB = pC = pD = 25%.

Given the constraints I gave above, the resulting partitions should be:

A = [1] B = [5, 5] C = [6, 7] D = [8, 8, 8, 8, 8]

with resulting (achieved/corrected) percentages pcA = 10%, pcB = 20%, pcC = 20%, pcD = 50%

It seems to me that I need a modified k-means algorithm, because the standard algorithm is not guaranteed to respect my percentages and/or the requirement that the same value cannot be in more than one cluster/partition.

So, is there an algorithm for this kind of clustering?

What happends if you specify 4 partitions and have an array [ 1, 1, 1, 1, 1, 1, 1, 8]? — Femaref
– Femaref, Commented Nov 15, 2011 at 17:00
First, you should create some more examples to make the requirements clear. For example, what do you expect for k=4, 25% distribution, when ar=[1,2,3,4,5,6,7,8,9,10]? — Doc Brown
– Doc Brown, Commented Nov 15, 2011 at 17:04
You will need to define some sort of measure to quantify how close a particular partitioning is to the goal. Without such a measure, you wouldn't know which solution is "best". The naive approach (partition according to the original percentages, then move the partition boundaries to accommodate the constraint) will always give you a solution, you just don't know how good it is. — fmr
– fmr, Commented Nov 15, 2011 at 17:23
@Femaref I have the same question. The requirements are coming from the clients, which are obviously not that technical. My guess is they will say that the number of data is sufficiently high to ensure that such a situation will never arise. I realize this is not helping to formulate the algorithm properly. — AsGoodAsItGets
– AsGoodAsItGets, Commented Nov 16, 2011 at 8:53
@DocBrown Your example is very simple actually. It would be something like A=[1,2], B=[3,4,5], C=[6,7], D=[8,9,10], or A=[1,2,3], B=[4,5], C=[6,7,8], D=[9,10], but both are acceptable. It depends on how you do your rounding on the division. — AsGoodAsItGets
– AsGoodAsItGets, Commented Nov 16, 2011 at 8:56

Tim Gee · Accepted Answer · 2011-12-10 16:06:22Z

1

Clustering algorithms are used on multi-dimensional data. For one-dimensional data, you should simply use a sorting algorithm.

Sort the data. Then partition the data-set linearly working from the bottom of the array to the top, as per your example.

answered Dec 10, 2011 at 16:06

Tim Gee

1,0627 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user97370 · Accepted Answer · 2011-12-10 17:25:42Z

1

Here's a dynamic programming solution that finds a partition that minimizes the sum of squares of the errors in the sizes of the parts. So in your example of [1, 5, 5, 6, 7, 8, 8, 8, 8, 8], you want parts of size (2.5, 2.5, 2.5, 2.5) and the result given by this code is (9.0, (1, 2, 2, 5)). That means the partitions chosen were of size 1, 2, 2 and 5, and the total error is 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5-5)^2.

def partitions(a, i, sizes, cache):
    """Find a least-cost partition of a[i:].

    The ideal sizes of the partitions are stored in the tuple 'sizes'
    and cache is used to memoize previously calculated results.
    """
    key = (i, sizes)
    if key in cache: return cache[key]
    if len(sizes) == 1:
        segment = len(a) - i
        result = (segment - sizes[0]) ** 2, (segment,)
        cache[key] = result
        return result
    best_cost, best_partition = None, None
    for j in xrange(len(a) - i + 1):
        if 0 < j < len(a) - i and a[i + j - 1] == a[i + j]:
            # Avoid breaking a run of one number.
            continue
        bc, bp = partitions(a, i + j, sizes[1:], cache)
        c = (j - sizes[0]) ** 2 + bc
        if best_cost is None or c < best_cost:
            best_cost = c
            best_partition = (j,) + bp
    cache[key] = (best_cost, best_partition)
    return cache[key]


ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
sizes = (len(ar) * 0.25,) * 4
print partitions(ar, 0, (2.5, 2.5, 2.5, 2.5), {})

answered Dec 10, 2011 at 17:25

user97370

2 Comments

AsGoodAsItGets Over a year ago

Looks like you're on to something here Paul, thanks. Is this pseudocode or some of the new-fangled languages that I'm not aware of (Scala?) I will have a closer look and get back to you.

user97370 Over a year ago

It's python: it's not exactly new-fangled, but on a good day it does look like pseudocode.

fmr · Accepted Answer · 2011-11-17 17:22:22Z

0

The naive approach would go like this:

Say p1...pk are the percentages for your partitions (p1+...+pk = 1)

Say you have N elements in the array

The initial boundaries (there's k+1 of them, including the array ends, since you have k partitions)are: 0, p1*N, (p1+p2)*N, ..., N (there'll be some rounding to do).

For moving the boundaries, you look at the two array elements on each side of a boundary (for the k-1 boundaries that you can move). If the two elements are equal, you need to move to boundary, either left of right, at least until the constraint is satisfied. A naive approach would be to start on the left and do minimal adjustments (just adjust the constraint to the side that causes the least movement, and don't move the boundary any further).

This algorithm doesn't cover the whole space of partitions though. It just gives you one solution. To find the best solution, you'd need to do a brute-force search on the entire partition space, with some kind of pruning (e.g. dynamic programming, where you remember the best partitioning for a subarray of the initial array).

answered Nov 17, 2011 at 17:22

fmr

1,51813 silver badges14 bronze badges

4 Comments

AsGoodAsItGets Over a year ago

Let's try your algorithm on a scenario like this: ar = [1, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10] with Pi=0.25 and k=4, N=12. So b0 = 0, b1 = 3, b2 = 6, b3 = 9, b4 = 12. We obviously can't change b0 or b4 so we start from b1 = 3. ar[3] = ar[2] = ar[4] = 9. Do I check left or right? If I go left, I will reach 1 at ar[0] and my first boundary will be b1 = 8. If I go right, I will reach 10 at ar[7] and my first boundary will be b1 = 8.

AsGoodAsItGets Over a year ago

Clearly, if I go right I will not have an optimal solution, not even close, because I will not be able to continue past b1 and I will end up with only 2 partitions. If I go left, I will have a slightly better partition, but still only 2 partitions. Conversely, in a scenario like ar = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 9, 10] I would have similar problems.

AsGoodAsItGets Over a year ago

In other words, when the distribution is not uniform, I am not sure this naive approach works. Also, moving the boundary left or right, can have a significant impact on the final result, and it seems to me that someone needs to be able to backtrack and start all over again following the opposite direction.

fmr Over a year ago

Right - your examples are such that there's only one solution that is not even close to the expected percentages. Again, for a full solution you need to explore the space of partitions.

Collectives™ on Stack Overflow

Number clustering/partitioning algorithm

3 Answers 3

Comments

2 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related