Create clusters using correlation matrix in Python

Question

all, I have a correlation matrix of 21 industry sectors. Now I want to split these 21 sectors into 4 or 5 groups, with sectors of similar behaviors grouped together.

Can experts shed me some lights on how to do this in Python please? Thanks much in advance!

sklearn has plenty of clustering algorithms, this forum is more aimed at specific coding problems than general "How do I" questions — G. Anderson
– G. Anderson, Commented Oct 12, 2018 at 21:59
Thanks much, seralouk and Hielka. Could either of you give me a simple example on how to get started pls? I'm not good enough at Python yet. — Jasper C.
– Jasper C., Commented Oct 12, 2018 at 22:00

Martijn Courteaux · Accepted Answer · 2024-11-14 15:08:16Z

24

UPDATE: This answer is wrong, and your clustering will not work correctly. Do not use it and read the explanation in Martijn Courteaux's answer below.

You might explore the use of Pandas DataFrame.corr and the scipy.cluster Hierarchical Clustering package

import pandas as pd
import scipy.cluster.hierarchy as spc


df = pd.DataFrame(my_data)
corr = df.corr().values

pdist = spc.distance.pdist(corr)
linkage = spc.linkage(pdist, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist.max(), 'distance')

edited Nov 14, 2024 at 15:08

Martijn Courteaux

69.1k48 gold badges202 silver badges297 bronze badges

answered Oct 12, 2018 at 22:01

Wes Doyle

2,2973 gold badges19 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Wes Doyle Over a year ago

Here is a link to an example use of scipy and pandas that may be of interest: github.com/TheLoneNut/CorrelationMatrixClustering/blob/master/…

Rylan Schaeffer Over a year ago

What do I do with idx once I've obtained it?

cjm2671 Over a year ago

Is this right? Surely if a correlation is 0, then the pairwise distance is 0, which is the opposite of what we want?

jf328 Over a year ago

I don't understand the logic behind pdist(corr). Shouldn't 1-corr be the distance, not Euclidean distance between two rows?

till Kadabra Over a year ago

Can you explain the 0.5 * pdist.max() please?

|

Martijn Courteaux · Accepted Answer · 2023-04-18 08:20:48Z

4

Okay, @Wes' answer was suggesting to use some good functions for the task, however he used them incorrectly. After some more reading of the documentation, it seems you need a condensed pairwise distance matrix before passing it to the spc.linkage function, which is the upper-triangular part of the distance matrix, row by row.

It also says that the spc.pdist function returns a distance matrix in that condensed form. However, the input is NOT a correlation matrix or anything like that. It needs observations and will turn them into the matrix itself given the specified metric.

Now, it will come to no surprise to you that a covariance or correlation matrix already summarizes observations into a matrix. Instead of representing a distance, it represents correlation. Here is where I am unsure of what is mathematically the most sound thing to do, but I believe you could turn this correlation matrix into a distance matrix of some sort by just calculating 1.0 - corr.

So let's do that:

pdist_uncondensed = 1.0 - corr
pdist_condensed = np.concatenate([row[i+1:] for i, row in enumerate(pdist_uncondensed)])
linkage = spc.linkage(pdist_condensed, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist_condensed.max(), 'distance')

answered Apr 18, 2023 at 8:20

Martijn Courteaux

69.1k48 gold badges202 silver badges297 bronze badges

4 Comments

pas-calc Oct 29 at 15:05

pdist_condensed = pdist_uncondensed[np.triu_indices_from(pdist_uncondensed, k=1)] is a shorter way to get the values of the upper triangular, using numpy

pas-calc Oct 29 at 15:09

1.0-corr is in the range 0..2 , so one might set the threshold of spc.fcluster to t=1 instead of setting it based on the observed data 0.5*max (alternatively one could use the median) ?

pas-calc Oct 29 at 15:18

It would make sense to adjust the threshold, especially to a value, as asked by OP, to get "4 or 5 groups".

pas-calc Oct 31 at 14:44

scipy.spatial.distance.squareform is an even easier way to get condensed distance matrix, the upper triangular data (and vice versa)

JeeyCi · Accepted Answer · 2024-11-16 13:39:24Z

Clustering algorithms use any distance metric (e.g. similarity metric or dissimilarity=1-S). Distance metric goes out from Norm definition - for example Euclidean distance is measured with L2-norm(or Euclidean norm). Mahalanobis distance is a weighted Euclidean distance. L1-norm is Manhattan distance. There are many other similarity metrics, as by wiki-link given.

You should understand vector norms to calculate distances - e.g. see Vector and Matrix Norms with NumPy Linalg Norm or here:

Vector norm is the magnitude (or length) of the vector

Because your correlation index (I suppose, Pearson corr.) defines the relationship between 2 vectors lineary X ~ Y and clustering algorithm uses distance in the space between X1 ~ X2 , still you can consider X1 to be X and X2 to be Y in this simple example of two features, but for multidimensional data you use paired distance in space, or distance between point and vector, or distance between 2 vectors norm(x - y) - any in docs to scipy

Pay attention to the topic of norms because they are meaningfull in distance measurements in space:

All norms can be used to create a distance function

P.S. And answering to the quastion in comments to this answer

I'd like to repeat @tillKadabra's question: why 0.5 * pdist.max()?

-- you'd better see the topic "how-does-condensed-distance-matrix-work-pdist" - and docs for fcluster

this is the threshold to apply when forming flat clusters.

-- I think such a threashold will divide samples for 2 clusters, as an example, perhaps

Collectives™ on Stack Overflow

Create clusters using correlation matrix in Python

3 Answers 3

10 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related