Traversing a 1D Numpy Array for Clustering

Question

I have a 1-dimensional numpy array where the value for each element, i, points to another index in the array. Each cluster center has a unique negative (integer) value. The goal is to assign each element to a cluster.

Example

# Generated/pre-computed elsewhere
a = np.array([6, 8, 1, -1, 0, 3, -2, 4, -3, 10, 5])

Thus, the cluster centers are elements 3, 6, and 8 (since they have a negative value) and they are labeled as clusters -1, -2, and -3, respectively. So, since

a[0] = 6 --> a[6] = -2,

then a[0] can be assigned as -2. Similarly, since

a[5] = 3 --> a[3] = -1,

then a[5] can be assigned as -1. Following this logic then all elements can be assigned to a cluster center. The resulting array would be:

[-2, -3, -3, -1, -2, -1, -2, -2, -3, -1, -1]

I know how to achieve this on paper but I don't see how to accomplish this in code or using vectorized code in numpy.

Update: Based on unutbu's answer below, I replaced the while loop with a for loop in order to avoid an endless while loop:

a = np.array([6, 8, 1, -1, 0, 3, -2, 4, -3, 10, 5])
for i in range(len(a)):
    mask = a >= 0
    if not mask.any(): break
    a[mask] = a[a[mask]]

Have you tried Anything? If so please post it.

wwii
– wwii

2015-01-21 19:55:05 +00:00
Commented Jan 21, 2015 at 19:55 — wwii
– wwii, Commented Jan 21, 2015 at 19:55
Your process does not seem to work for a[9].

wwii
– wwii

2015-01-21 19:58:10 +00:00
Commented Jan 21, 2015 at 19:58 — wwii
– wwii, Commented Jan 21, 2015 at 19:58
@wwii: a[9] --> a[10] --> a[5] --> a[3] --> -1

slaw
– slaw

2015-01-21 20:16:32 +00:00
Commented Jan 21, 2015 at 20:16 — slaw
– slaw, Commented Jan 21, 2015 at 20:16

unutbu · Accepted Answer · 2015-01-22 14:47:34Z

We don't know a priori how many "hops" it takes to find a cluster center. So we'll have to do some iteration and check if we've landed on a cluster center:

import numpy as np
a = np.array([6, 8, 1, -1, 0, 3, -2, 4, -3, 10, 5])

for i in a:
    mask = a>=0
    # We can stop when all the values in `a` are negative
    if not mask.any(): break
    # perform a hop
    a[mask] = a[a[mask]]

print(a)

yields

[-2 -3 -3 -1 -2 -1 -2 -2 -3 -1 -1]

To see what's going on, it might be clearer to look at a simpler value of a:

a = np.arange(-1, 10)
print(a)
for i in a:
    mask = a>=0
    # We can stop when all the values in `a` are negative
    if not mask.any(): break
    # perform a hop
    a[mask] = a[a[mask]]
    print(a)

prints

[-1  0  1  2  3  4  5  6  7  8  9]
[-1 -1  0  1  2  3  4  5  6  7  8]
[-1 -1 -1 -1  0  1  2  3  4  5  6]
[-1 -1 -1 -1 -1 -1 -1 -1  0  1  2]
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1]

Here we can clearly see that the values at each index (i.e. a column in the output) are decreasing. You might have expected the values to decrease by 1, but in fact they decrease by more than one in subsequent iterations because of the effect of previous hops.

More generally, let G be a sequence of indices that hop to the same cluster value. An invariant of the loop is that values in G map to values in G. Also, at each index, with each iteration the distance (i.e. number of hops) to a cluster value decreases (or is zero). These two facts together imply by the Banach fixed point theorem that the algorithm converges to a fixed point.

Also with each iteration the number of locations filled with cluster values grows exponentially. If before an iteration there are n locations filled with cluster values, then after a hop there will be 2n locations filled with cluster values. So the convergence is exponential.

Following the OP's lead, I changed the while True loop to a for-loop. The fixed point will be found in fewer than len(a) steps, but nevertheless for i in a suffices.

Note that the code above assumes that every index converges to a cluster value. If that is not true, then the original while True loop could loop forever. The for i in a will end, but maybe not at all cluster values. A simple example which illustrates the problem is a = np.array([1,0]).

Interestingly, the "simple example" np.arange(-1, 10) is also the worst-case scenario in terms of maximum iterations required for an array of length 11. Since the number of cluster values grows as 2**n, the loop requires at most on the order of log2(len(a)) iterations. Therefore, we can write a function will either return the cluster values or raise a ValueError when a contains infinite cycles:

def converge(a):
    for i in range(int(np.log2(len(a)))+1):
        mask = a>=0
        # We can stop when all the values in `a` are negative
        if not mask.any(): break
        # perform a hop
        a[mask] = a[a[mask]]
    else:
        # for-loop completed without reaching break
        mask = a>=0
        if mask.any(): 
            raise ValueError('{!r} contains infinite cycles'.format(a))
    return a

Each hop modifies elements/items, When a modified element is referenced in subsequent hops would it be guaranteed to contain the correct value at that point? It seems like it should and works for this example but I just couldn't see my way through it.
@wwii: I've added above some words of explanation. The values referenced in subsequent hops are guaranteed to contain values from the same sequence G.

Collectives™ on Stack Overflow

Traversing a 1D Numpy Array for Clustering

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related