partitioning an float array into similar segments (clustering)

Question

I have an array of floats like this:

[1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200]

Now, I want to partition the array like this:

[[1.91, 2.87, 3.61] , [10.91, 11.91, 12.82] , [100.73, 100.71, 101.89] , [200]]

// [200] will be considered as an outlier because of less cluster support

I have to find this kind of segment for several arrays and I don't know what should be the partition size. I tried to do it by using hierarchical clustering (Agglomerative) and it gives satisfactory results for me. However, issue is, I was suggested not to use clustering algorithms for one-dimensional problem as their is no theoretical justification (as they are for multidimensional data) to do that.

I spent lots of time to find solution. However, suggestions seem quite different like: this and this VS. this and this and this.

I found another suggestion rather than clustering i.e. natural breaks optimization. However, this also needs to declare the partition number like K-means (right ?).

It is quite confusing (specially because I have to perform those kind of segmentation on several arrays and it is impossible to know the optimal partition number).

Are there any ways to find partitions (thus we can reduce the variance within partitions and maximize the variance between partitions) with some theoretical justification?

Any pointers to article/papers (if available C/C++/Java implementation) with some theoretical justification will be very helpful for me.

I am curious as for why clustering does not fit for one dimensional data - what if you somehow increase the dimensionality, e.g., add sqrt(n) as a dimension, a bit like what happens in SVMs? — zw324
– zw324, Commented Jul 5, 2013 at 1:53
@ZiyaoWei, "why clustering does not fit for one dimensional data" - truly I don't know. I was told in class that it is crazy to use clustering in 1-d data. but, i found no article stating why I can't (or can). — alessandro
– alessandro, Commented Jul 5, 2013 at 1:56
@ZiyaoWei increasing dimention without reason does not seems a good solution. — alessandro
– alessandro, Commented Jul 5, 2013 at 1:57
No it is not, just thinking that there's no real difference between one dimensional and multi dimensional data. Or are they? — zw324
– zw324, Commented Jul 5, 2013 at 1:58
"...reduce the variance within partitions and maximize the variance between partitions..." If you tell us exactly what you mean by that, maybe we can help. Do you mean minimize ((average variance within a partition) - (average variance between partitions)), or what? — Beta
– Beta, Commented Jul 5, 2013 at 1:59

Jerry Coffin · Accepted Answer · 2013-07-05 02:41:37Z

11

I think I'd sort the data (if it's not already), then take adjacent differences. Divide the differences by the smaller of the numbers it's a difference between to get a percentage change. Set a threshold and when the change exceeds that threshold, start a new "cluster".

Edit: Quick demo code in C++:

#include <iostream>
#include <vector>
#include <algorithm>
#include <iterator>
#include <numeric>
#include <functional>

int main() {
    std::vector<double> data{ 
        1.91, 2.87, 3.61, 10.91, 11.91, 12.82, 100.73, 100.71, 101.89, 200 
    };

    // sort the input data
    std::sort(data.begin(), data.end());

    // find the difference between each number and its predecessor
    std::vector<double> diffs;
    std::adjacent_difference(data.begin(), data.end(), std::back_inserter(diffs));

    // convert differences to percentage changes
    std::transform(diffs.begin(), diffs.end(), data.begin(), diffs.begin(),
        std::divides<double>());

    // print out the results
    for (int i = 0; i < data.size(); i++) {

        // if a difference exceeds 40%, start a new group:
        if (diffs[i] > 0.4)
            std::cout << "\n";

        // print out an item:
        std::cout << data[i] << "\t";
    }

    return 0;
}

Result:

1.91    2.87    3.61
10.91   11.91   12.82
100.71  100.73  101.89
200

edited Jul 5, 2013 at 2:41

answered Jul 5, 2013 at 1:57

Jerry Coffin

494k83 gold badges656 silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

alessandro Over a year ago

can you kindly elaborate this ? I can't get it (may be in pseudo code if possible) ?

deep_rugs Over a year ago

I tried with larger sample. Looks like it doesn't work [78, 89, 74, 42, 89, 22, 48, 26, 28, 92, 100, 96, 35, 5, 70, 76, 11, 70, 12, 91, 7, 38, 19, 68, 58, 2, 89, 20, 30, 81, 95, 11, 97, 81, 86, 43, 52, 48, 71, 91, 4, 64, 94, 41, 82, 16, 35, 13, 57, 50]

Jerry Coffin Over a year ago

@deep_rugs: I think you've misunderstood the intent. When your data is sorted, there's only one break because there's no place in your data where there's a change of greater than 40% between one number and the next. If you care about changes with the data in its original order, remove the std::sort line and change if (diffs[i] > 0.4) to if (std::abs(diff[i]) > 0.4).

Has QUIT--Anony-Mousse · Accepted Answer · 2013-07-05 07:44:13Z

4

Clustering usually assumes multidimensional data.

If you have one dimensional data, sort it, and then use either kernel density estimation, or just scan for the largest gaps.

In 1 dimension, the problem gets substantially easier, because the data can be sorted. If you use a clustering algorithm, it will unfortunately not exploit this, so use a 1 dimensional method instead!

Consider finding the largest gap in 1 dimensional data. It's trivial: sort (n log n, but in practise as fast as it can get), then look at two adjacent values for the largest difference.

Now try defining "largest gap" in 2 dimensions, and an efficient algorithm to locate it...

answered Jul 5, 2013 at 7:44

Has QUIT--Anony-Mousse

77.8k14 gold badges146 silver badges198 bronze badges

Collectives™ on Stack Overflow

partitioning an float array into similar segments (clustering)

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related