1

As a beginner in data science, I want to cluster data to visualize the distribution of the data.

This is the current state. Each point is a data point with some x and y value.

current visualisation

I want to get something like this. So I want to count all data points in a 2d-grid-cell and replace it with one point, that size shows the count of the data-point in that 'cluster-grid-point'

I'm pretty sure there is a pandas/matplotlib function that will help me - but on clustering or grouping, I found nothing helpful.

that is my goal / the larger the point, the more data points are in that 'cluster'-grid-cell

2

1 Answer 1

1

Here is my crack at it -- note I am no matplot-wiz or pandas ninja (I am more of an R/ggplot guy). There are probably easier ways to work with the data in python/pandas.

import numpy as np
print('numpy: {}'.format(np.__version__))
import matplotlib as mpl
print('matplotlib: {}'.format(mpl.__version__))
import pandas as pd
print('pandas: {}'.format(pd.__version__))
%matplotlib inline
import matplotlib.pyplot as plt

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

#  Define the names of the variables as we want them

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris = pd.read_csv(url, names=names)

plt.figure(figsize=(8,4),dpi=288)
iris.plot(kind='scatter', x="petal-length", y="petal-width")
plt.show()

d1 = iris.assign(
    petal_length_cut = pd.qcut(iris['petal-length'],5, labels=np.linspace(1,7,5)),
    petal_width_cut = pd.qcut(iris['petal-width'],5, labels=np.linspace(0,2.5,5))
)
d2 = d1.assign(cartesian=pd.Categorical(d1.filter(regex='_cut').apply(tuple, 1)))
d3 = d2[['petal-length', 'petal-width', 'cartesian']]
print(d3)
hist = d3['cartesian'].value_counts()
print(hist)

x=[c[0]+.25 for c in hist.index]
y=[c[1]+.5 for c in hist.index]
s=[hist[c]* 10 for c in hist.index]
plt.scatter(x,y,s=s)
plt.show

original plot iris petal length vs width enter image description here

Got a little better control of the binning and placement using:

length_bins = pd.cut(iris['petal-length'],7)
width_bins = pd.cut(iris['petal-width'],5)
bins = pd.DataFrame({"l":length_bins, "w":width_bins})
hist = bins.value_counts()

hist.index = [(i[0].mid, i[1].mid) for i in hist.index]
#print(hist)

x=[c[0] for c in hist.index]
y=[c[1] for c in hist.index]
s=[hist[c]* 10 for c in hist.index]
plt.xlim([0,7])
plt.ylim([0,2.5])
plt.scatter(x,y,s=s)
plt.show

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.