1

I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster.

Output Example:

Cluster     Names
AZ          A,B
AY,AY       C,E
AZ,AZ       F 

PS. Actual dataset may vary in number of rows and columns How can I do it using ML based clustering algorithms like KNN, Naive Bayes, etc? I need two approaches, one without ignoring NaN, one ignoring NaN.

0

1 Answer 1

1

IIUC, this looks like a double groupby. You first need to group by Name and System and aggregate System to form the clusters. Then this is a simple groupby with aggregation as string.

(df.groupby(['Name', 'System'])
   ['System'].agg(Cluster=','.join)          # clusters of repeats
   .droplevel('System').reset_index()
   .groupby('Cluster')['Name'].agg(','.join) # aggregate by cluster
   .reset_index()
)

output:

  Cluster   Name
0   AY,AY    C,E
1      AZ  A,B,D
2   AZ,AZ      F

NB. I used aggregation with ','.join here but you could also use tuple or a custom function (first groupby) and frozenset (second groupby) to keep access to the individual elements

(df.groupby(['Name', 'System'], dropna=False)
   ['System'].agg(Cluster=lambda x: (x.iloc[0], len(x)))
   .droplevel('System').reset_index()
   .groupby('Cluster')['Name'].agg(frozenset)
   .reset_index()
)

output:

    Cluster       Name
0   (AY, 2)     (E, C)
1   (AZ, 1)  (B, A, D)
2   (AZ, 2)        (F)
3  (nan, 1)  (E, F, A)
4  (nan, 2)        (B)
Sign up to request clarification or add additional context in comments.

4 Comments

How can I do the same using KMeans Clustering?
I don't see how this relates to k-means clustering
Something like Maybe mapping these letters to integers first and converting to array, then fit it to KMeans
I don't really see how this applies. Here this is a perfect aggregation of categories. k-means would rather be useful on continuous data. The question is how do you think k-means would improve anything here?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.