1

I have a DataFrame with IDs which are members of different groups (each column is a separate group) - full data.

df
      ID  G_01  G_02  G_03  G_04
0   A_01   1.0   NaN   NaN   NaN
1   A_02   1.0   NaN   NaN   NaN
2   A_03   1.0   1.0   NaN   NaN
3   A_04   NaN   1.0   NaN   NaN
4   A_05   NaN   1.0   NaN   NaN
5   A_06   NaN   NaN   NaN   1.0
6   A_07   NaN   NaN   1.0   1.0
7   A_08   NaN   NaN   1.0   NaN
8   A_09   NaN   NaN   1.0   NaN
9   A_10   NaN   NaN   1.0   NaN
10  A_11   NaN   NaN   1.0   1.0
11  A_12   NaN   NaN   NaN   1.0

As you can see, some IDs are members of more than 1 Group. Therefore G_01 and G_02 should be grouped as one cluster (they share A_03) and G_03 and G_04 share A_07 and A_11 but don't share any ID with G_01 and G_02 therefore should be grouped as cluster 2, like below:

      ID  G_01  G_02  G_03  G_04  Cluster
0   A_01   1.0   NaN   NaN   NaN        1
1   A_02   1.0   NaN   NaN   NaN        1
2   A_03   1.0   1.0   NaN   NaN        1
3   A_04   NaN   1.0   NaN   NaN        1
4   A_05   NaN   1.0   NaN   NaN        1
5   A_06   NaN   NaN   NaN   1.0        2
6   A_07   NaN   NaN   1.0   1.0        2
7   A_08   NaN   NaN   1.0   NaN        2
8   A_09   NaN   NaN   1.0   NaN        2
9   A_10   NaN   NaN   1.0   NaN        2
10  A_11   NaN   NaN   1.0   1.0        2
11  A_12   NaN   NaN   NaN   1.0        2

The number of IDs and Groups isn't constant and I don't know it in advance. Do you have any idea how to achieve this clustering?

EDIT

Order of the columns should not matter. If I change it to G_02, G_03, G_01, G_04 I'd like to receive the same result as with G_01, G_02, G_03, G_04.

data I am working on

1 Answer 1

2

This can be solved by looking for the connected components in your data. One approach is to use scipy.ndimage.measurements.label to label them):

import numpy as np
from scipy import ndimage

#labels the different connected components in the data
x_components, _ = ndimage.measurements.label(df.drop('ID', 1).fillna(0))
#finds the actual "cluster" to which each data points bellongs by returning the row max
df['cluster'] = x_components.max(1)

print(df)

      ID  G_01  G_02  G_03  G_04    cluster
0   A_01   1.0   NaN   NaN   NaN        1
1   A_02   1.0   NaN   NaN   NaN        1
2   A_03   1.0   1.0   NaN   NaN        1
3   A_04   NaN   1.0   NaN   NaN        1
4   A_05   NaN   1.0   NaN   NaN        1
5   A_06   NaN   NaN   NaN   1.0        2
6   A_07   NaN   NaN   1.0   1.0        2
7   A_08   NaN   NaN   1.0   NaN        2
8   A_09   NaN   NaN   1.0   NaN        2
9   A_10   NaN   NaN   1.0   NaN        2
10  A_11   NaN   NaN   1.0   1.0        2
11  A_12   NaN   NaN   NaN   1.0        2

Where the x_components are the labeled components:

print(x_components)

array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [1, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 0, 2],
       [0, 0, 2, 2],
       [0, 0, 2, 0],
       [0, 0, 2, 0],
       [0, 0, 2, 0],
       [0, 0, 2, 2],
       [0, 0, 0, 2]])
Sign up to request clarification or add additional context in comments.

7 Comments

Hi @yatu, almost there but this approach is creating separate unique clusters instead of ID-linked-together ones. With above df I get clusters 1,1,2,3,3,4,5,6,6,6,5,4
Hi @BartekNowakowski I'm getting the correct result using the dataframe you've shared. Are you sure it is the same df?
That was just an example, my original df has 623 IDs and 71 groups. I'll look closer into it and let you know.
It'd be helpful if you find some example where this fails to get what you want @BartekNowakowski
Can't figure this out... here's data I am using (IDs and Groups anonymized): 1drv.ms/x/s!AgH13E5f0n83g61L0cTwk_hoxscjLg In sheet 'result' column BV has clusters. Row 17 is already cluster 2 instead of 1.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.