2

I have an ordered 1-D array of numbers. Both the array length and the values of the numbers in the array are arbitrary. I want to partition the array into k partitions, according to the number values, e.g. let's say I want 4 partitions, distributed as 30% / 30% / 20% / 20%, i.e. the top 30% values first, the next 30% afterwards, etc. I get to choose k and the percentages of the distribution. In addition, if the same number appears more than once in the array, it should not be contained in two different partitions. This means that the distribution percentages above are not strict, but rather the "goals" or "starting points" if you wish.

For example, let's say my array is ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8].

I choose k = 4 and the numbers should be distributed into partitions A, B, C and D with percentages pA = pB = pC = pD = 25%.

Given the constraints I gave above, the resulting partitions should be:

A = [1] B = [5, 5] C = [6, 7] D = [8, 8, 8, 8, 8]

with resulting (achieved/corrected) percentages pcA = 10%, pcB = 20%, pcC = 20%, pcD = 50%

It seems to me that I need a modified k-means algorithm, because the standard algorithm is not guaranteed to respect my percentages and/or the requirement that the same value cannot be in more than one cluster/partition.

So, is there an algorithm for this kind of clustering?

8
  • 4
    What happends if you specify 4 partitions and have an array [ 1, 1, 1, 1, 1, 1, 1, 8]? Commented Nov 15, 2011 at 17:00
  • 1
    First, you should create some more examples to make the requirements clear. For example, what do you expect for k=4, 25% distribution, when ar=[1,2,3,4,5,6,7,8,9,10]? Commented Nov 15, 2011 at 17:04
  • 2
    You will need to define some sort of measure to quantify how close a particular partitioning is to the goal. Without such a measure, you wouldn't know which solution is "best". The naive approach (partition according to the original percentages, then move the partition boundaries to accommodate the constraint) will always give you a solution, you just don't know how good it is. Commented Nov 15, 2011 at 17:23
  • @Femaref I have the same question. The requirements are coming from the clients, which are obviously not that technical. My guess is they will say that the number of data is sufficiently high to ensure that such a situation will never arise. I realize this is not helping to formulate the algorithm properly. Commented Nov 16, 2011 at 8:53
  • @DocBrown Your example is very simple actually. It would be something like A=[1,2], B=[3,4,5], C=[6,7], D=[8,9,10], or A=[1,2,3], B=[4,5], C=[6,7,8], D=[9,10], but both are acceptable. It depends on how you do your rounding on the division. Commented Nov 16, 2011 at 8:56

3 Answers 3

1

Clustering algorithms are used on multi-dimensional data. For one-dimensional data, you should simply use a sorting algorithm.

Sort the data. Then partition the data-set linearly working from the bottom of the array to the top, as per your example.

Sign up to request clarification or add additional context in comments.

Comments

1

Here's a dynamic programming solution that finds a partition that minimizes the sum of squares of the errors in the sizes of the parts. So in your example of [1, 5, 5, 6, 7, 8, 8, 8, 8, 8], you want parts of size (2.5, 2.5, 2.5, 2.5) and the result given by this code is (9.0, (1, 2, 2, 5)). That means the partitions chosen were of size 1, 2, 2 and 5, and the total error is 9 = (2.5-1)^2 + (2.5-2)^2 + (2.5-2)^2 + (2.5-5)^2.

def partitions(a, i, sizes, cache):
    """Find a least-cost partition of a[i:].

    The ideal sizes of the partitions are stored in the tuple 'sizes'
    and cache is used to memoize previously calculated results.
    """
    key = (i, sizes)
    if key in cache: return cache[key]
    if len(sizes) == 1:
        segment = len(a) - i
        result = (segment - sizes[0]) ** 2, (segment,)
        cache[key] = result
        return result
    best_cost, best_partition = None, None
    for j in xrange(len(a) - i + 1):
        if 0 < j < len(a) - i and a[i + j - 1] == a[i + j]:
            # Avoid breaking a run of one number.
            continue
        bc, bp = partitions(a, i + j, sizes[1:], cache)
        c = (j - sizes[0]) ** 2 + bc
        if best_cost is None or c < best_cost:
            best_cost = c
            best_partition = (j,) + bp
    cache[key] = (best_cost, best_partition)
    return cache[key]


ar = [1, 5, 5, 6, 7, 8, 8, 8, 8, 8]
sizes = (len(ar) * 0.25,) * 4
print partitions(ar, 0, (2.5, 2.5, 2.5, 2.5), {})

2 Comments

Looks like you're on to something here Paul, thanks. Is this pseudocode or some of the new-fangled languages that I'm not aware of (Scala?) I will have a closer look and get back to you.
It's python: it's not exactly new-fangled, but on a good day it does look like pseudocode.
0

The naive approach would go like this:

Say p1...pk are the percentages for your partitions (p1+...+pk = 1)

Say you have N elements in the array

The initial boundaries (there's k+1 of them, including the array ends, since you have k partitions)are: 0, p1*N, (p1+p2)*N, ..., N (there'll be some rounding to do).

For moving the boundaries, you look at the two array elements on each side of a boundary (for the k-1 boundaries that you can move). If the two elements are equal, you need to move to boundary, either left of right, at least until the constraint is satisfied. A naive approach would be to start on the left and do minimal adjustments (just adjust the constraint to the side that causes the least movement, and don't move the boundary any further).

This algorithm doesn't cover the whole space of partitions though. It just gives you one solution. To find the best solution, you'd need to do a brute-force search on the entire partition space, with some kind of pruning (e.g. dynamic programming, where you remember the best partitioning for a subarray of the initial array).

4 Comments

Let's try your algorithm on a scenario like this: ar = [1, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10] with Pi=0.25 and k=4, N=12. So b0 = 0, b1 = 3, b2 = 6, b3 = 9, b4 = 12. We obviously can't change b0 or b4 so we start from b1 = 3. ar[3] = ar[2] = ar[4] = 9. Do I check left or right? If I go left, I will reach 1 at ar[0] and my first boundary will be b1 = 8. If I go right, I will reach 10 at ar[7] and my first boundary will be b1 = 8.
Clearly, if I go right I will not have an optimal solution, not even close, because I will not be able to continue past b1 and I will end up with only 2 partitions. If I go left, I will have a slightly better partition, but still only 2 partitions. Conversely, in a scenario like ar = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 9, 10] I would have similar problems.
In other words, when the distribution is not uniform, I am not sure this naive approach works. Also, moving the boundary left or right, can have a significant impact on the final result, and it seems to me that someone needs to be able to backtrack and start all over again following the opposite direction.
Right - your examples are such that there's only one solution that is not even close to the expected percentages. Again, for a full solution you need to explore the space of partitions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.