Clustering in DataFrame using Python

Question

I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster. We can ignore NaN.

Output Example:

Cluster     Names
AZ          A,B
AY,AY       C,E
AZ,AZ       F

How can I do it using Python?

PS. Actual dataset may vary in number of rows and columns Also, how can I do it using ML based classification algorithms like KNN, Naive Bayes, etc?

In the above question, how can I form clusters without ignoring NaN values. — spd
– spd, Commented Feb 2, 2022 at 17:22

user7864386 · Accepted Answer · 2022-02-02 07:23:10Z

4

Use groupby + agg twice; once to join "Systems" and then to join "Names":

s = df.dropna().groupby('Name').agg(', '.join)['System']
s = pd.Series(s.index, index=s)
out = s.groupby(level=0).agg(', '.join).reset_index().rename(columns={'System':'Cluster'})

Output:

  Cluster     Name
0  AY, AY     C, E
1      AZ  A, B, D
2  AZ, AZ        F

answered Feb 2, 2022 at 7:23

user7864386

Sign up to request clarification or add additional context in comments.

4 Comments

spd Over a year ago

These work perfectly fine. Is there any ML based clustering way [eg. KNN] I can do this.

spd Over a year ago

Can you please share the code for the same.

spd Over a year ago

Sure, Please do share.

spd Over a year ago

stackoverflow.com/q/70966309/17778275

jezrael · Accepted Answer · 2022-02-02 07:33:27Z

4

If ordering per groups is same use double groupby by Name and then by System columns:

df1 = (df.dropna(subset=['System'])
         .groupby('Name')['System']
         .agg(','.join)
         .reset_index()
         .groupby('System')['Name']
         .agg(','.join)
         .rename_axis('Cluster')
         .reset_index())

print (df1)
  Cluster   Name
0   AY,AY    C,E
1      AZ  A,B,D
2   AZ,AZ      F

If ordering should be different, so sort values is necessary use:

df1 = (df.dropna(subset=['System'])
         .sort_values(['Name','System'])
         .groupby('Name')['System'].agg(','.join)
         .reset_index()
         .groupby('System')['Name']
         .agg(','.join)
         .rename_axis('Cluster')
         .reset_index())

edited Feb 2, 2022 at 7:33

answered Feb 2, 2022 at 7:23

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

3 Comments

spd Over a year ago

These work perfectly fine. Is there any ML based clustering way [eg. KNN] I can do this.

jezrael Over a year ago

@SHLOKDOSHI - I think is necessary post new question.

spd Over a year ago

stackoverflow.com/q/70966309/17778275 Please share here.

Collectives™ on Stack Overflow

Clustering in DataFrame using Python

2 Answers 2

4 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related