3

I mused about how hard it would be to create an Excel Lambda function to deliver K-means clusters. I built a one-dimensional version of it with only a small investment of time.

One dimensional versions are, by themselves, quite useful in solving clustering problems. I used a 1D version for example to examine trip journey data of people walking between buildings on a campus (using access control data). The management of this company suspected that when employees walk between buildings while on duty, they were using the journeys to "go shopping", "take long breaks", etc. and I thought I could prove to what extent this really existed. The data quickly exposed interesting details - e.g., don't try to change buildings just after lunch because the elevators are full. It also exposed "clusters" of transit times:

  • Simple transit from A to B
  • Taking a cigarette or coffee break while transiting
  • Having lunch while transiting
  • Going shopping or otherwise AWOL

This proved the existence of the managers' suspicions and put measurements onto it. Unfortunately, it offered at least one major reason this facility had the lowest productivity in the world.

In another 1D case, I clustered invoicing amounts to build some kind of classifier - more like binning for a histogram.

So, here is my Lambda based, non-VBA one-dimensional K-means function:

=LAMBDA( dataArray, nodes, [changeThreshold], [kArray],
  LET(
      nSeq, SEQUENCE(, nodes),
      n, IF(ISOMITTED(kArray), nSeq / (nodes + 1), kArray),
      adjust, LAMBDA(array, nArray,
               BYCOL(
                     IF(BYROW(ABS(array - nArray), LAMBDA(x, MATCH(MIN(x), x, 0))) = nSeq, array, ""),
                     LAMBDA(x, AVERAGE(x))
                     )
                ),
     next_kArray, adjust(dataArray, n),
     IF(
         SUM(ABS(next_kArray - n)) <= IF(ISOMITTED(changeThreshold), 0.01, changeThreshold),
         kArray,
         KMEANS_1D( dataArray, nodes, changeThreshold, next_kArray )
        )
     )

Where the arguments are:

  1. dataArray - the one-dimensional (column-wise) data that you want to cluster.
  2. nodes - the number of nodes you want for clusters
  3. changeThreshold - optional value for the minimum aggregate change in values to stop iterating and declare the result as converged (default to 0.01)
  4. kArray - optional array used for recursion

NOTES: This has no error handling, so user input errors, data with errors, convergence faults, etc. result in Value!. This only takes vertically presented data from a single column. If you want to to take tabular, row-wise data, change it, but if I put in all of these manipulations, the algorithm gets lost in the code and this is only for illustration. Also, I chose not to make a random initial selection of nodes because I feared the function would become volatile. Therefore, I used a simple even distribution (nSeq / (nodes + 1)).

Now, the challenge is to make a multi-dimensional version. It has proven not to be so trivial. I will offer it as an answer and ask if there are improvements or alternatives.

I will not mark a correct answer - this is just for sharing and collaboration.

2 Answers 2

3

Here is the result:

=LAMBDA( dataArray, nodes, [changeThreshold], [kArray],
    LET(
         nSeq, SEQUENCE(, nodes),
         dimensions, COLUMNS(dataArray),
         dSeq, SEQUENCE(, dimensions, 0),
         rSeq, SEQUENCE(ROWS(dataArray)),
         nCoordSeq, SEQUENCE(, nodes * dimensions, 0),
         genStartK, LAMBDA(minDimensions, maxDimensions,
             LET(
                  incr, 2 * PI() / (nodes - 1),
                  center, VSTACK(SIGN(dSeq + 1) * 0.5),
                  ring, COS(DROP(TRANSPOSE(nSeq), -1) * incr + PI() * (dSeq + 1) / dimensions) * SQRT(2) / 4 + 0.5,
                  kMatrix, VSTACK(center, ring) * (maxDimensions - minDimensions) + minDimensions,
                  TOROW(kMatrix)
                  )
          ),
       reshape, LAMBDA(matrix, newCols, WRAPROWS(TOROW(matrix), newCols)),
       distance, LAMBDA(da, ka,
           LET(
                  squareDiff, (INDEX(da, rSeq, MOD(nCoordSeq, dimensions) + 1) - ka) ^ 2,
                  offsetAdd, BYROW(reshape(squareDiff, dimensions), LAMBDA(a, SUM(a))),
                  SQRT(reshape(offsetAdd, nodes))
              )
          ),
       adjust, LAMBDA(array, nArray,
              BYCOL(
                  IF(
                      BYROW(distance(array, nArray), LAMBDA(x, MATCH(MIN(x), x, 0))) =
                          INT(nCoordSeq / dimensions + 1),
                      INDEX(array, rSeq, MOD(nCoordSeq, dimensions) + 1)
                  ),
                  LAMBDA(x, AVERAGE(x))
              )
          ),
        n, IF(
              ISOMITTED(kArray),
              genStartK(BYCOL(dataArray, LAMBDA(a, MIN(a))), BYCOL(dataArray, LAMBDA(a, MAX(a)))),
              TOROW(kArray)
          ),
        next_kArray, IFERROR(adjust(dataArray, n), n),
        IF(
              SUM(ABS(next_kArray - n)) <= IF(ISOMITTED(changeThreshold), 0.01, changeThreshold),
              WRAPROWS(kArray, dimensions),
              KMEANS(dataArray, nodes, changeThreshold, next_kArray)
          )
      )

Inputs

The same arguments are used as the 1D version. The input array where each row represents a node and each column represents a dimension. e.g.:

example data input

In the example above dataArray would be the range B2:D5. Then the number of output nodes needs to be set (an integer number).

Outputs

The output is a set of multidimensional nodes that cluster the data where each row represents a node and each column represents a dimension. e.g.: example output

The outputs are in cells I2:K5 based on a three dimensional (columns) array as an input and 4 nodes for the expected output.

Example

Here is a two dimensional example to illustrate the function. Create two columns that represent the x and y of your sample data. Then put this formula into the x column:

=LET( r, RANDARRAY( 100 ),
  m, CHOOSE( ROUNDUP( 3*r, 0 ), 0.25, 0.5, 0.75 ),
  NORM.INV( r, m, SD ) )

and this formula into the y column:

=LET( r, RANDARRAY( 100 ),
  m, CHOOSE( ROUNDUP( 2*C3#, 0 ), 0.75, 0.25 ),
  IFERROR( NORM.INV( r, m, SD ), SD*RAND() ) )

This creates some artificial 2-D clusters.

Now apply the KMEANS function to the data with four nodes, and here are the results: 2D example results

Notes

A multi-dimensional (even just two-dimensional) version is an order of magnitude more difficult. With each problem encountered in debugging, I found a solution, but worried the whole time that this function for clustering was becoming a was cluster function (pun intended).

For example, the simple distance calculation ABS(dataArray - nArray) must be changed into a Pythagorean distance in the form SQRT( (dataArray - nArray)^2 ). It must also accommodate the formation of each array (e.g., are the dimensions presented column-wise, row-wise, in separate symmetrical tables,...). I ultimately created an internal lambda function for readability that can be edited without changing the other parts:

  distance, LAMBDA(da, ka,
    LET(
        squareDiff, (INDEX(da, rSeq, MOD(nCoordSeq, dimensions) + 1) - ka) ^ 2,
        offsetAdd, BYROW(reshape(squareDiff, dimensions), LAMBDA(a, SUM(a))),
        SQRT(reshape(offsetAdd, nodes)
        )
    )

While it is a simple calculation, it required a lot of structuring to do it across dimensions. I am unhappy with the distance function, because the only way I could find to do it was to use my own variant of WRAPROWS called reshape to build an array called offsetAdd that that facilitates BYROW summation. This is then reshaped back into columnar form to make a table of distances where columns represent nodes and rows represent each data point in dataArray.

The reshape function is there for readability.

reshape, LAMBDA(matrix, newCols, WRAPROWS(TOROW(matrix), newCols)),

The adjust internal function required substantial modification to adapt to multi-dimensionality.

adjust, LAMBDA(array, nArray,
    BYCOL(
        IF(
            BYROW(distance(array, nArray), LAMBDA(x, MATCH(MIN(x), x, 0))) =
                INT(nCoordSeq / dimensions + 1),
            INDEX(array, rSeq, MOD(nCoordSeq, dimensions) + 1)
           ),
        LAMBDA(x, AVERAGE(x))
       )
  )

No Error Handling I put no effort into error handling or adaptive inputs. This is already large, so it would be hard to understand if it was bloated with input shaping and error handling. There is one exception. I did wrap the adjust call with an IFERROR because I noticed that if by chance, one of the nodes is not selected by any data points in the adjustment iterations, it will throw a DIV/0 error. This could be handled a number of way - including my preference of do nothing. If you get this result, it is likely because you have too many nodes for the sample data (there are 3 tightly defined clusters but you asking for 5). The simplest (but most brute-force) solution was to replace an error with the prior node value (n). It works...

next_kArray, IFERROR(adjust(dataArray, n), n)

Improvements Wanted

If you have the time to dive into this and see ways to improve it, I would love to learn from you. I have hopefully made it modular enough to allow surgical improvements to parts.

Also, sorry that there are no code comments. For some reason, that feature is not working anymore in the Advanced Formula Environment.

Sign up to request clarification or add additional context in comments.

Comments

2

I'm kinda unfamiliar with K-Means and tried to understand it by doing a little research. I used reduce to iterate:

=LET(xy,C3:D102,
     k,3,
REDUCE(TAKE(xy,k),SEQUENCE(10),
LAMBDA(r,i,
       LET(t,BYROW(xy,
             LAMBDA(b,
                    LET(m,BYROW(((r-b)^2),
                          LAMBDA(n,SUM(n)))^0.5,
                                 XMATCH(MIN(m),m)))),
           MAP(SEQUENCE(k)+SEQUENCE(,COLUMNS(xy))*0,SEQUENCE(,COLUMNS(xy))+SEQUENCE(k)*0,
           LAMBDA(x,y,
                  AVERAGE(CHOOSECOLS(FILTER(xy,t=x),y))))))))

It first takes the first k rows (I figured there's no need for volatile random choice); this is the starting point.

From there on it checks the distance from each row in you range to the starting point k-rows values column-wise (to the power 2, then summed and then to the power a half). The values are assigned a nearest k value and all values nearing a k-value are averaged to become the new k-values.

I used iteration times 10.

4 Comments

Wow!!! What a brilliant solution! TBH - this whole idea was coming to me because k-means requires recursion (in theory). You avoided recursion with simple iteration. The REDUCE is so far out-of-the-box. You always create ingenious solutions. I tested it, side-by-side and it generates similar results - sometimes nearly exactly the same. I did distance measures and found it produces better fitting results 50% of the time (i.e. it is truly the same), so 10 iterations causes no issues (in my tests). I think it would not take much to make it recursive in any case.
Thanks for the compliments. But most of all, good to see you on here. It was your creativity in answers, amongst others, that made me have a wider view at Excel’s formula capabilities.
Hey P.b. - that's really kind and means a lot to me. Honestly, I like your clean and streamlined approaches that give new ways of thinking about a problem. I had a number of career and family changes which pulled me away from spending time on SO - which has always been a source of enjoyment. It will get better but not for a while longer. I have amused myself with other ideas for recursive LAMBDA and gotten some working but not had enough time to post them. Your approach above is such a revelation that I am now rethinking them, Perhaps on a quiet day, I will give it a try,
one more point: I learned K-means long after I had a real-world use case for it, so I invented my own method at that time. It was for an optimization problem that no longer exists because the Internet wiped out the need. However, I took a modified version of your formula and applied it with some success. I will post it as well, but it is purely entertainment. The idea is to assign a cost to each K node and then have the LAMBDA function find the optimal solution. Also, your method can easily be applied to what I would call XPMT (adapts to a given payment schedule).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.