How to output a Numpy Array with clusters on basis of two arrays with data and labels

Question

I would like to implement a function which does the following:

Receives a labeled dataset and splits the datapoints according to label

Args:
    X (np.ndarray): The dataset
    y (np.ndarray): The label for each point in the dataset

Returns:
    List[np.ndarray]: A list of arrays where the elements of each array
    are datapoints belonging to the label at that index.
    
Example:
>>> get_clusters(
        np.array([[0.8, 0.7], [0, 0.4], [0.3, 0.1]]), 
        np.array([0,1,0])
    )
>>> [array([[0.8, 0.7],[0.3, 0.1]]), 
     array([[0. , 0.4]])]

I'm currently a bit lost as I don't find any way to write into a certain index of the Numpy Array, so I can only append to the array, instead of append to the array in index 0 where I have the datapoint with label = 0.

Here is my current code:

i = 0 
labels = {}
clusters = np.array([
        ])


for a in y:
    if a in labels:
        il = labels[a]
        clusters = np.append(clusters,X[i])
    else:
        labels[a] = i
        clusters = np.append(clusters,X[i])
    i+=1
    
    

return clusters

Can anybody help me with implementing the function? Thank you!

GoodDeeds · Accepted Answer · 2021-04-30 11:50:24Z

3

You can use:

def get_clusters(X, y):
    return [X[np.where(y==i)] for i in range(np.amax(y)+1)]

Here, np.amax(y)+1 calculates the length of the list, assuming it to be from 0 to the maximum value in y (this can be changed if necessary). Then, np.where(y==i) finds indices of each label, which are then selected from X. The order of the for loop ensures that each index corresponds to the label of that value.

answered Apr 30, 2021 at 11:50

GoodDeeds

8,6275 gold badges40 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

gboffi Over a year ago

While your (elegant, upvoted) solution works with the test case provided, what if the labels are sparse, or even they are non-numeric ('A', 'C', 'Q')? I'd suggest ... for i in sorted(set(y))

GoodDeeds Over a year ago

@gboffi Thanks, I agree with you, but I think that would not satisfy the OP's specification: "A list of arrays where the elements of each array are datapoints belonging to the label at that index.". Each datapoint's index would no longer be its label. So I assumed that the data is such that the issues you mentioned wouldn't arise.

Collectives™ on Stack Overflow

How to output a Numpy Array with clusters on basis of two arrays with data and labels

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related