Wrong Graph Plot using K-Means in Python

Question

This is my first time implementing a Machine Learning Algorithm in Python. I tried implementing K-Means using Python and Sklearn for this dataset.

from sklearn.cluster import KMeans
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Importing the dataset
data = pd.read_csv('dataset.csv')
print("Input Data and Shape")
print(data.shape)
data.head()


# Getting the values and plotting it
f1 = data['Area'].values
f2 = data['perimeter'].values
f3 = data['Compactness'].values
f4 = data['length_kernel'].values
f5 = data['width_kernel'].values
f6 = data['asymmetry'].values
f7 = data['length_kernel_groove'].values



X = np.array(list(zip(f1,f2,f3,f4,f5,f6,f7)))
# Number of clusters
kmeans = KMeans(n_clusters=7)
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_

plt.scatter(X[:,0], X[:,1],cmap='rainbow')
plt.scatter(centroids[:,0], centroids[:1], color="black", marker='*')
plt.show()

The graph doesn't seem to plot the data correctly. How can I debug this issue?

$\begingroup$ What is the dimensionality (ie. shape) of the array X? $\endgroup$

JahKnows
– JahKnows

2017-11-15 08:03:39 +00:00
Commented Nov 15, 2017 at 8:03 — JahKnows
– JahKnows, Commented Nov 15, 2017 at 8:03

Kasra Manshaei · Accepted Answer · 2017-11-15 11:31:39Z

Well, there are some issues:

Dimension vs K: Before talking about visualization I would like to address some clustering concept. Your data is in 7 dimensions but it does not mean that you have 7 clusters! Be careful here. For instance I have two features of people let's say salary and number of years they have working experience. Here I have two features but does it mean that there necessary two categories inside the data? sure not!
Visualization: Your data is in 7 dimension which is not visualizable. So you decided to reduce this to two which is a correct approach but you did a wrong thing for this correct approach. You can not take the first two features to visualize 7 dimensions, you need to REDUCE it to two features using Dimensionality Reduction algorithms like PCA, NMF, etc. What you did is actually IGNORING 5 dimensions of the points which are extremely informative for placing them in a 7-dimensional space.

Solution

Everything is right. Just add a PCA to your code like this:

From sklearn.decomposition import PCA
Model = PCA(n_components=2)
X_new = Model.fit_transform(X)
... Use X_new instead of X for K-means procedure

Please note that I wrote this relying on my memory so better to check the documentation if I had a typo or smth. In case you have more question you can comment here.

Good Luck!

Stack Exchange Network

Wrong Graph Plot using K-Means in Python

1 Answer 1

Solution

Your Answer

Hot Network Questions

Wrong Graph Plot using K-Means in Python

1 Answer 1

Solution

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions