Im studying machine learning and want to implement K means clustering to understand it better. I have a dataset of cats, with each having 4 measurements. I want to cluster them into 2 and 3 distinct hard clusters based on these properties to see if their breeds can be determined based on these measurements.
I am familiar with the general algorithm, but I am struggling to get my head around
- What numbers should I be randomising inbetween to initialise my centroids? 1 to 10? Does it really matter?
- How do I find the Euclidean distance between a
(x,y)tuple and a cat (which has 4 properties)?
Generally in Euclidean distance you compare x, y etc values against each other but if I have 4 properties how can I measure how far it is from (x,y) pair on a 2d plane? It doesn't make much sense to me even after reading up on this concept.
I believe that on a 2d plane I can indeed only look at two properties out of the 4 - or is this not correct? Without compressing data dimension to 2 I dont see how one could do that.
Ps: I know there is libraries that implement K-means clustering, that is not the point.