K-Means Clustering Algorithm implementation

Question

Im studying machine learning and want to implement K means clustering to understand it better. I have a dataset of cats, with each having 4 measurements. I want to cluster them into 2 and 3 distinct hard clusters based on these properties to see if their breeds can be determined based on these measurements.

I am familiar with the general algorithm, but I am struggling to get my head around

What numbers should I be randomising inbetween to initialise my centroids? 1 to 10? Does it really matter?
How do I find the Euclidean distance between a (x,y) tuple and a cat (which has 4 properties)?

Generally in Euclidean distance you compare x, y etc values against each other but if I have 4 properties how can I measure how far it is from (x,y) pair on a 2d plane? It doesn't make much sense to me even after reading up on this concept.

I believe that on a 2d plane I can indeed only look at two properties out of the 4 - or is this not correct? Without compressing data dimension to 2 I dont see how one could do that.

Ps: I know there is libraries that implement K-means clustering, that is not the point.

nikhilbalwani · Accepted Answer · 2019-11-14 19:51:13Z

What you're referring to in your question is the euclidean distance of two points in a 2D plane. You want to perform clustering in a plane where each properties vector itself is a data point, which is not possible with 2D planes. Hence, you want to deal with an n-dimensional plane, where each data point is an n-dimensional vector. Each of these dimensions represents a feature. In your case, n is 4, since you have 4 features (properties) per data point.

You can randomize centroids by choosing any vector that has values ranging from minimum of all the feature vectors to their maximum.

Let's say you have 3 different cats with the following properties: [1, 5, 9, 10], [2, 3, 4, 3], [5, 6, 1, 5]. These are nothing but feature vectors. You will run the clustering as below:

You begin by computing the min and max vectors. min = [1, 3, 1, 3] and max = [5, 6, 9,10]. So you assign centroids in the following range: [1...5, 3...6, 1...9, 3...10].
Once the centroids are initialized (either randomly or based on heuristic estimates), you run the algorithm and recompute centroids on each iteration.
You calculate the euclidean distance as the euclidean distance of 2 vectors:

enter image description here

where qi is the ith element in vector q°

Hope it helped!

Collectives™ on Stack Overflow

K-Means Clustering Algorithm implementation

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related