KMeans Clustering using Python

Question

I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster.

Output Example:

Cluster     Names
AZ          A,B
AY,AY       C,E
AZ,AZ       F

PS. Actual dataset may vary in number of rows and columns How can I do it using ML based clustering algorithms like KNN, Naive Bayes, etc? I need two approaches, one without ignoring NaN, one ignoring NaN.

mozway · Accepted Answer · 2022-02-03 13:38:14Z

1

IIUC, this looks like a double groupby. You first need to group by Name and System and aggregate System to form the clusters. Then this is a simple groupby with aggregation as string.

(df.groupby(['Name', 'System'])
   ['System'].agg(Cluster=','.join)          # clusters of repeats
   .droplevel('System').reset_index()
   .groupby('Cluster')['Name'].agg(','.join) # aggregate by cluster
   .reset_index()
)

output:

  Cluster   Name
0   AY,AY    C,E
1      AZ  A,B,D
2   AZ,AZ      F

NB. I used aggregation with ','.join here but you could also use tuple or a custom function (first groupby) and frozenset (second groupby) to keep access to the individual elements

(df.groupby(['Name', 'System'], dropna=False)
   ['System'].agg(Cluster=lambda x: (x.iloc[0], len(x)))
   .droplevel('System').reset_index()
   .groupby('Cluster')['Name'].agg(frozenset)
   .reset_index()
)

output:

    Cluster       Name
0   (AY, 2)     (E, C)
1   (AZ, 1)  (B, A, D)
2   (AZ, 2)        (F)
3  (nan, 1)  (E, F, A)
4  (nan, 2)        (B)

edited Feb 3, 2022 at 13:38

answered Feb 3, 2022 at 13:25

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

spd Over a year ago

How can I do the same using KMeans Clustering?

mozway Over a year ago

I don't see how this relates to k-means clustering

spd Over a year ago

Something like Maybe mapping these letters to integers first and converting to array, then fit it to KMeans

mozway Over a year ago

I don't really see how this applies. Here this is a perfect aggregation of categories. k-means would rather be useful on continuous data. The question is how do you think k-means would improve anything here?

Collectives™ on Stack Overflow

KMeans Clustering using Python

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related