6

I'm working on a Python function, where I want to model a Gaussian distribution, I'm stuck though.

import numpy.random as rnd
import numpy as np

def genData(co1, co2, M):
  X = rnd.randn(2, 2M + 1)
  t = rnd.randn(1, 2M + 1)
  numpy.concatenate(X, co1)
  numpy.concatenate(X, co2)
  return(X, t)

I'm trying for two clusters of size M, cluster 1 is centered at co1, cluster 2 is centered at co2. X would return the data points I'm going to graph, and t are the target values (1 if cluster 1, 2 if cluster 2) so I can color it by cluster.

In that case, t is size 2M of 1s/2s and X is size 2M * 1, wherein t[i] is 1 if X[i] is in cluster 1 and the same for cluster 2.

I figured the best way to start doing this is generating the array array using numpys random. What I'm confused about is how to get it centered according to the cluster?


Would the best way be to generate a cluster sized M, then add co1 to each of the points? How would I make it random though, and make sure t[i] is colored in properly?

I'm using this function to graph the data:

def graphData():
    co1 = (0.5, -0.5)
    co2 = (-0.5, 0.5)
    M = 1000
    X, t = genData(co1, co2, M)
    colors = np.array(['r', 'b'])
    plt.figure()
    plt.scatter(X[:, 0], X[:, 1], color = colors[t], s = 10)
2
  • 2
    Use numpy.random.multivariate_normal. Give the mean argument as a vector of length 2; that will be the location of the cluster. Commented Nov 4, 2017 at 20:32
  • @WarrenWeckesser Thanks Warren, but how will I make it so X is random and t will tell me which cluster it belongs to? Commented Nov 4, 2017 at 20:58

2 Answers 2

9

For your purpose, I would go for sklearn sample generator make_blobs:

from sklearn.datasets import make_blobs

centers = [(-5, -5), (5, 5)]
cluster_std = [0.8, 1]

X, y = make_blobs(n_samples=100, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1)

plt.scatter(X[y == 0, 0], X[y == 0, 1], color="red", s=10, label="Cluster1")
plt.scatter(X[y == 1, 0], X[y == 1, 1], color="blue", s=10, label="Cluster2")

You can generate multi-dimensional clusters with this. X yields data points and y is determining which cluster a corresponding point in X belongs to.

enter image description here

This might be too much for what you try to achieve in this case, but generally, I think it's better to rely on more general and better-tested library codes that can be used in other cases as well.

Sign up to request clarification or add additional context in comments.

1 Comment

Works great. The only minor thing is samples_generator is now deprecated. Should use from sklearn.datasets import make_blobs instead.
3

You can use something like following code:

center1 = (50, 60)
center2 = (80, 20)
distance = 20


x1 = np.random.uniform(center1[0], center1[0] + distance, size=(100,))
y1 = np.random.normal(center1[1], distance, size=(100,)) 

x2 = np.random.uniform(center2[0], center2[0] + distance, size=(100,))
y2 = np.random.normal(center2[1], distance, size=(100,)) 

plt.scatter(x1, y1)
plt.scatter(x2, y2)
plt.show()

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.