I have two distance matrices, each 232*232 where the column and row labels are identical. So this would be an abridged version of the two where A, B, C and D are the names of the points between which the distances are measured:
A B C D ... A B C D ...
A 0 1 5 3 A 0 5 3 9
B 4 0 4 1 B 2 0 7 8
C 2 6 0 3 C 2 6 0 1
D 2 7 1 0 D 5 2 5 0
... ...
The two matrices therefore represent the distances between pairs of points in two different networks. I want to identify clusters of pairs that are close together in one network and far apart in the other. I attempted to do this by first adjusting the distances in each matrix by dividing every distance by the largest distance in the matrix. I then subtracted one matrix from the other and applied a clustering algorithm to the resultant matrix. The algorithm I was advised to use for this was the k means algorithm. The hope was that I could identify clusters of positive numbers that would correspond to pairs that were very close in matrix one and far apart in matrix two and vice versa for clusters of negative numbers.
Firstly, I've read quite a bit about how to implement k means in python I'm aware that there are multiple different modules that can be used. I've tried all three of these:
1.
import sklearn.cluster
import numpy as np
data = np.load('difference_matrix_file.npy') #loads difference matrix from file
a = np.array([x[0:] for x in data])
clust_centers = 3
model = sklearn.cluster.k_means(a, clust_centers)
print model
2.
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.cluster import KMeans
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
data = pd.DataFrame(difference_matrix)
model = KMeans(n_clusters=3)
print model.fit(data)
3.
import numpy as np
from scipy.cluster.vq import vq, kmeans, whiten
np.set_printoptions(threshold=np.nan)
difference_matrix = np.load('difference_matrix_file.npy') #loads difference matrix from file
whitened = whiten(difference_matrix)
centroids = kmeans(whitened, 3)
print centroids
What I'm struggling with is how to interpret the output from these scripts. (I might add at this point that I'm neither a mathematician nor a computer scientist if the reader hadn't already guessed). I was expecting the output of the algorithm to be lists of coordinates of clustered pairs, one for each cluster so three in this case, that I could then trace back to my two original matrices and identify the names of the pairs of interest.
However what I get is an array containing a list of numbers (one for each cluster) but I don't really understand what these numbers are, they don't obviously correspond to what I had in my input matrix other than the fact that there is 232 items in each list which is the same number of rows and columns there are in the input matrix. And the list item in the array is another single number which I presume must be the centroid of the clusters, but there isn't one for each cluster, just one for the whole array.
I've been trying to figure this out for quite a while now but I'm struggling to get anywhere. Whenever I search for interpreting the output of kmeans I just get explanations of how to plot my clusters on a graph which isn't what I want to do. Please can someone explain to me what I'm seeing in my output and how I can get from this to the coordinates of the items in each cluster?