Python: clustering DataFrame based on overlapping items

Question

I have a DataFrame with IDs which are members of different groups (each column is a separate group) - full data.

df
      ID  G_01  G_02  G_03  G_04
0   A_01   1.0   NaN   NaN   NaN
1   A_02   1.0   NaN   NaN   NaN
2   A_03   1.0   1.0   NaN   NaN
3   A_04   NaN   1.0   NaN   NaN
4   A_05   NaN   1.0   NaN   NaN
5   A_06   NaN   NaN   NaN   1.0
6   A_07   NaN   NaN   1.0   1.0
7   A_08   NaN   NaN   1.0   NaN
8   A_09   NaN   NaN   1.0   NaN
9   A_10   NaN   NaN   1.0   NaN
10  A_11   NaN   NaN   1.0   1.0
11  A_12   NaN   NaN   NaN   1.0

As you can see, some IDs are members of more than 1 Group. Therefore G_01 and G_02 should be grouped as one cluster (they share A_03) and G_03 and G_04 share A_07 and A_11 but don't share any ID with G_01 and G_02 therefore should be grouped as cluster 2, like below:

      ID  G_01  G_02  G_03  G_04  Cluster
0   A_01   1.0   NaN   NaN   NaN        1
1   A_02   1.0   NaN   NaN   NaN        1
2   A_03   1.0   1.0   NaN   NaN        1
3   A_04   NaN   1.0   NaN   NaN        1
4   A_05   NaN   1.0   NaN   NaN        1
5   A_06   NaN   NaN   NaN   1.0        2
6   A_07   NaN   NaN   1.0   1.0        2
7   A_08   NaN   NaN   1.0   NaN        2
8   A_09   NaN   NaN   1.0   NaN        2
9   A_10   NaN   NaN   1.0   NaN        2
10  A_11   NaN   NaN   1.0   1.0        2
11  A_12   NaN   NaN   NaN   1.0        2

The number of IDs and Groups isn't constant and I don't know it in advance. Do you have any idea how to achieve this clustering?

EDIT

Order of the columns should not matter. If I change it to G_02, G_03, G_01, G_04 I'd like to receive the same result as with G_01, G_02, G_03, G_04.

data I am working on

yatu · Accepted Answer · 2020-01-31 10:11:59Z

2

This can be solved by looking for the connected components in your data. One approach is to use scipy.ndimage.measurements.label to label them):

import numpy as np
from scipy import ndimage

#labels the different connected components in the data
x_components, _ = ndimage.measurements.label(df.drop('ID', 1).fillna(0))
#finds the actual "cluster" to which each data points bellongs by returning the row max
df['cluster'] = x_components.max(1)

print(df)

      ID  G_01  G_02  G_03  G_04    cluster
0   A_01   1.0   NaN   NaN   NaN        1
1   A_02   1.0   NaN   NaN   NaN        1
2   A_03   1.0   1.0   NaN   NaN        1
3   A_04   NaN   1.0   NaN   NaN        1
4   A_05   NaN   1.0   NaN   NaN        1
5   A_06   NaN   NaN   NaN   1.0        2
6   A_07   NaN   NaN   1.0   1.0        2
7   A_08   NaN   NaN   1.0   NaN        2
8   A_09   NaN   NaN   1.0   NaN        2
9   A_10   NaN   NaN   1.0   NaN        2
10  A_11   NaN   NaN   1.0   1.0        2
11  A_12   NaN   NaN   NaN   1.0        2

Where the x_components are the labeled components:

print(x_components)

array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [1, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 0, 2],
       [0, 0, 2, 2],
       [0, 0, 2, 0],
       [0, 0, 2, 0],
       [0, 0, 2, 0],
       [0, 0, 2, 2],
       [0, 0, 0, 2]])

edited Jan 31, 2020 at 10:11

answered Jan 31, 2020 at 9:45

yatu

88.7k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Bartek Nowakowski Over a year ago

Hi @yatu, almost there but this approach is creating separate unique clusters instead of ID-linked-together ones. With above df I get clusters 1,1,2,3,3,4,5,6,6,6,5,4

yatu Over a year ago

Hi @BartekNowakowski I'm getting the correct result using the dataframe you've shared. Are you sure it is the same df?

Bartek Nowakowski Over a year ago

That was just an example, my original df has 623 IDs and 71 groups. I'll look closer into it and let you know.

yatu Over a year ago

It'd be helpful if you find some example where this fails to get what you want @BartekNowakowski

Bartek Nowakowski Over a year ago

Can't figure this out... here's data I am using (IDs and Groups anonymized): 1drv.ms/x/s!AgH13E5f0n83g61L0cTwk_hoxscjLg In sheet 'result' column BV has clusters. Row 17 is already cluster 2 instead of 1.

|

Collectives™ on Stack Overflow

Python: clustering DataFrame based on overlapping items

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related