1

I would like to implement a function which does the following:

Receives a labeled dataset and splits the datapoints according to label

Args:
    X (np.ndarray): The dataset
    y (np.ndarray): The label for each point in the dataset

Returns:
    List[np.ndarray]: A list of arrays where the elements of each array
    are datapoints belonging to the label at that index.
    
Example:
>>> get_clusters(
        np.array([[0.8, 0.7], [0, 0.4], [0.3, 0.1]]), 
        np.array([0,1,0])
    )
>>> [array([[0.8, 0.7],[0.3, 0.1]]), 
     array([[0. , 0.4]])]

I'm currently a bit lost as I don't find any way to write into a certain index of the Numpy Array, so I can only append to the array, instead of append to the array in index 0 where I have the datapoint with label = 0.

Here is my current code:

i = 0 
labels = {}
clusters = np.array([
        ])


for a in y:
    if a in labels:
        il = labels[a]
        clusters = np.append(clusters,X[i])
    else:
        labels[a] = i
        clusters = np.append(clusters,X[i])
    i+=1
    
    

return clusters

Can anybody help me with implementing the function? Thank you!

1 Answer 1

3

You can use:

def get_clusters(X, y):
    return [X[np.where(y==i)] for i in range(np.amax(y)+1)]

Here, np.amax(y)+1 calculates the length of the list, assuming it to be from 0 to the maximum value in y (this can be changed if necessary). Then, np.where(y==i) finds indices of each label, which are then selected from X. The order of the for loop ensures that each index corresponds to the label of that value.

Sign up to request clarification or add additional context in comments.

2 Comments

While your (elegant, upvoted) solution works with the test case provided, what if the labels are sparse, or even they are non-numeric ('A', 'C', 'Q')? I'd suggest ... for i in sorted(set(y))
@gboffi Thanks, I agree with you, but I think that would not satisfy the OP's specification: "A list of arrays where the elements of each array are datapoints belonging to the label at that index.". Each datapoint's index would no longer be its label. So I assumed that the data is such that the issues you mentioned wouldn't arise.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.