23

all, I have a correlation matrix of 21 industry sectors. Now I want to split these 21 sectors into 4 or 5 groups, with sectors of similar behaviors grouped together.

Can experts shed me some lights on how to do this in Python please? Thanks much in advance!

5
  • 1
    you can use machine learning clustering methods. Commented Oct 12, 2018 at 21:54
  • take a look at scipy. Commented Oct 12, 2018 at 21:56
  • sklearn has plenty of clustering algorithms, this forum is more aimed at specific coding problems than general "How do I" questions Commented Oct 12, 2018 at 21:59
  • Thanks much, seralouk and Hielka. Could either of you give me a simple example on how to get started pls? I'm not good enough at Python yet. Commented Oct 12, 2018 at 22:00
  • Got you, Anderson. I will take a look at your link. Thanks! Commented Oct 12, 2018 at 22:03

3 Answers 3

24

UPDATE: This answer is wrong, and your clustering will not work correctly. Do not use it and read the explanation in Martijn Courteaux's answer below.


You might explore the use of Pandas DataFrame.corr and the scipy.cluster Hierarchical Clustering package

import pandas as pd
import scipy.cluster.hierarchy as spc


df = pd.DataFrame(my_data)
corr = df.corr().values

pdist = spc.distance.pdist(corr)
linkage = spc.linkage(pdist, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist.max(), 'distance')
Sign up to request clarification or add additional context in comments.

10 Comments

Here is a link to an example use of scipy and pandas that may be of interest: github.com/TheLoneNut/CorrelationMatrixClustering/blob/master/…
What do I do with idx once I've obtained it?
Is this right? Surely if a correlation is 0, then the pairwise distance is 0, which is the opposite of what we want?
I don't understand the logic behind pdist(corr). Shouldn't 1-corr be the distance, not Euclidean distance between two rows?
Can you explain the 0.5 * pdist.max() please?
|
4

Okay, @Wes' answer was suggesting to use some good functions for the task, however he used them incorrectly. After some more reading of the documentation, it seems you need a condensed pairwise distance matrix before passing it to the spc.linkage function, which is the upper-triangular part of the distance matrix, row by row.

It also says that the spc.pdist function returns a distance matrix in that condensed form. However, the input is NOT a correlation matrix or anything like that. It needs observations and will turn them into the matrix itself given the specified metric.

Now, it will come to no surprise to you that a covariance or correlation matrix already summarizes observations into a matrix. Instead of representing a distance, it represents correlation. Here is where I am unsure of what is mathematically the most sound thing to do, but I believe you could turn this correlation matrix into a distance matrix of some sort by just calculating 1.0 - corr.

So let's do that:

pdist_uncondensed = 1.0 - corr
pdist_condensed = np.concatenate([row[i+1:] for i, row in enumerate(pdist_uncondensed)])
linkage = spc.linkage(pdist_condensed, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist_condensed.max(), 'distance')

4 Comments

pdist_condensed = pdist_uncondensed[np.triu_indices_from(pdist_uncondensed, k=1)] is a shorter way to get the values of the upper triangular, using numpy
1.0-corr is in the range 0..2 , so one might set the threshold of spc.fcluster to t=1 instead of setting it based on the observed data 0.5*max (alternatively one could use the median) ?
It would make sense to adjust the threshold, especially to a value, as asked by OP, to get "4 or 5 groups".
scipy.spatial.distance.squareform is an even easier way to get condensed distance matrix, the upper triangular data (and vice versa)
0

Clustering algorithms use any distance metric (e.g. similarity metric or dissimilarity=1-S). Distance metric goes out from Norm definition - for example Euclidean distance is measured with L2-norm(or Euclidean norm). Mahalanobis distance is a weighted Euclidean distance. L1-norm is Manhattan distance. There are many other similarity metrics, as by wiki-link given.

You should understand vector norms to calculate distances - e.g. see Vector and Matrix Norms with NumPy Linalg Norm or here:

Vector norm is the magnitude (or length) of the vector

Because your correlation index (I suppose, Pearson corr.) defines the relationship between 2 vectors lineary X ~ Y and clustering algorithm uses distance in the space between X1 ~ X2 , still you can consider X1 to be X and X2 to be Y in this simple example of two features, but for multidimensional data you use paired distance in space, or distance between point and vector, or distance between 2 vectors norm(x - y) - any in docs to scipy

Pay attention to the topic of norms because they are meaningfull in distance measurements in space:

All norms can be used to create a distance function

P.S. And answering to the quastion in comments to this answer

I'd like to repeat @tillKadabra's question: why 0.5 * pdist.max()?

-- you'd better see the topic "how-does-condensed-distance-matrix-work-pdist" - and docs for fcluster

this is the threshold to apply when forming flat clusters.

-- I think such a threashold will divide samples for 2 clusters, as an example, perhaps

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.