Using python to generate clusters of data?

Question

I'm working on a Python function, where I want to model a Gaussian distribution, I'm stuck though.

import numpy.random as rnd
import numpy as np

def genData(co1, co2, M):
  X = rnd.randn(2, 2M + 1)
  t = rnd.randn(1, 2M + 1)
  numpy.concatenate(X, co1)
  numpy.concatenate(X, co2)
  return(X, t)

I'm trying for two clusters of size M, cluster 1 is centered at co1, cluster 2 is centered at co2. X would return the data points I'm going to graph, and t are the target values (1 if cluster 1, 2 if cluster 2) so I can color it by cluster.

In that case, t is size 2M of 1s/2s and X is size 2M * 1, wherein t[i] is 1 if X[i] is in cluster 1 and the same for cluster 2.

I figured the best way to start doing this is generating the array array using numpys random. What I'm confused about is how to get it centered according to the cluster?

Would the best way be to generate a cluster sized M, then add co1 to each of the points? How would I make it random though, and make sure t[i] is colored in properly?

I'm using this function to graph the data:

def graphData():
    co1 = (0.5, -0.5)
    co2 = (-0.5, 0.5)
    M = 1000
    X, t = genData(co1, co2, M)
    colors = np.array(['r', 'b'])
    plt.figure()
    plt.scatter(X[:, 0], X[:, 1], color = colors[t], s = 10)

Use numpy.random.multivariate_normal. Give the mean argument as a vector of length 2; that will be the location of the cluster. — Warren Weckesser
– Warren Weckesser, Commented Nov 4, 2017 at 20:32
@WarrenWeckesser Thanks Warren, but how will I make it so X is random and t will tell me which cluster it belongs to? — Andrew Raleigh
– Andrew Raleigh, Commented Nov 4, 2017 at 20:58

Ganesh Tata · Accepted Answer · 2023-08-29 12:24:53Z

9

For your purpose, I would go for sklearn sample generator make_blobs:

from sklearn.datasets import make_blobs

centers = [(-5, -5), (5, 5)]
cluster_std = [0.8, 1]

X, y = make_blobs(n_samples=100, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1)

plt.scatter(X[y == 0, 0], X[y == 0, 1], color="red", s=10, label="Cluster1")
plt.scatter(X[y == 1, 0], X[y == 1, 1], color="blue", s=10, label="Cluster2")

You can generate multi-dimensional clusters with this. X yields data points and y is determining which cluster a corresponding point in X belongs to.

This might be too much for what you try to achieve in this case, but generally, I think it's better to rely on more general and better-tested library codes that can be used in other cases as well.

edited Aug 29, 2023 at 12:24

Ganesh Tata

1,20510 silver badges29 bronze badges

answered Jan 18, 2019 at 6:03

Farzad Vertigo

2,8481 gold badge33 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jaeyoung Chun Over a year ago

Works great. The only minor thing is samples_generator is now deprecated. Should use from sklearn.datasets import make_blobs instead.

HojjatK · Accepted Answer · 2019-01-18 06:53:28Z

3

You can use something like following code:

center1 = (50, 60)
center2 = (80, 20)
distance = 20


x1 = np.random.uniform(center1[0], center1[0] + distance, size=(100,))
y1 = np.random.normal(center1[1], distance, size=(100,)) 

x2 = np.random.uniform(center2[0], center2[0] + distance, size=(100,))
y2 = np.random.normal(center2[1], distance, size=(100,)) 

plt.scatter(x1, y1)
plt.scatter(x2, y2)
plt.show()

edited Jan 18, 2019 at 6:53

answered Jan 18, 2019 at 4:57

HojjatK

2,1381 gold badge18 silver badges16 bronze badges

Collectives™ on Stack Overflow

Using python to generate clusters of data?

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related