2

I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster. We can ignore NaN.

Output Example:

Cluster     Names
AZ          A,B
AY,AY       C,E
AZ,AZ       F 

How can I do it using Python?

PS. Actual dataset may vary in number of rows and columns Also, how can I do it using ML based classification algorithms like KNN, Naive Bayes, etc?

1
  • In the above question, how can I form clusters without ignoring NaN values. Commented Feb 2, 2022 at 17:22

2 Answers 2

4

Use groupby + agg twice; once to join "Systems" and then to join "Names":

s = df.dropna().groupby('Name').agg(', '.join)['System']
s = pd.Series(s.index, index=s)
out = s.groupby(level=0).agg(', '.join).reset_index().rename(columns={'System':'Cluster'})

Output:

  Cluster     Name
0  AY, AY     C, E
1      AZ  A, B, D
2  AZ, AZ        F
Sign up to request clarification or add additional context in comments.

4 Comments

These work perfectly fine. Is there any ML based clustering way [eg. KNN] I can do this.
Can you please share the code for the same.
Sure, Please do share.
4

If ordering per groups is same use double groupby by Name and then by System columns:

df1 = (df.dropna(subset=['System'])
         .groupby('Name')['System']
         .agg(','.join)
         .reset_index()
         .groupby('System')['Name']
         .agg(','.join)
         .rename_axis('Cluster')
         .reset_index())

print (df1)
  Cluster   Name
0   AY,AY    C,E
1      AZ  A,B,D
2   AZ,AZ      F

If ordering should be different, so sort values is necessary use:

df1 = (df.dropna(subset=['System'])
         .sort_values(['Name','System'])
         .groupby('Name')['System'].agg(','.join)
         .reset_index()
         .groupby('System')['Name']
         .agg(','.join)
         .rename_axis('Cluster')
         .reset_index())

3 Comments

These work perfectly fine. Is there any ML based clustering way [eg. KNN] I can do this.
@SHLOKDOSHI - I think is necessary post new question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.