Numpy implementation of simple clustering algorithm

Question

I would like to implement a simple clustering algorithm using python. First I will describe the problem:

I have some points each point is represented by an id, and there a pair probability between each pair i.e prob(id1, id2)=some_value. This is arranged in numpy array of shape [N,3], where N is the number of all possible point pairs. To make more clear here is an example array:

a = np.array([[1,2, 0.9],
     [2,3, 0.63],
     [3,4, 0.98],
     [4,5, 0.1],
     [5,6, 0.98],
     [6,7, 1]])

where the first two entries are the point ids and the third entry is the probability that they belong to each other.

The clustering problem is connect points that pass probability cut cut=0.5 i.e. points 1,2,3,4 belong to the same cluster and 5,6,7 belong to another cluster. The current solution that I have is make a list of lists(of point ids) i.e l=[[1,2,3,4],[5,6,7]] by looping twice over the unique point ids and array a. Is there a smarter and faster way to do this?

I don't see how you'd get that result [1,2,3,4], [5,6,7] from the example data you've posted. — AKX
– AKX, Commented Mar 4, 2020 at 15:20
5, 6 belong in the same cluster, making this a single cluster. — Mad Physicist
– Mad Physicist, Commented Mar 4, 2020 at 15:25
@AKX that's because [1,2] pass the cut >0.5, also [2,3], [3,4] i.e 1,2,3 and 4 are belonging to same cluster, however [4,5] does not pass the cut, similarly for [5,6,7] — Waleed Esmail
– Waleed Esmail, Commented Mar 4, 2020 at 17:34

Mad Physicist · Accepted Answer · 2020-03-04 19:39:49Z

2

The problem you describe is a graph problem. Many common graph algorithms are implemented in the networkx package.

import numpy as np
import networkx as nx
threshold = 0.5

If your threshold is written in stone, you can pre-apply it, and build your Graph from the remaining data:

G = nx.Graph()
G.add_weighted_edges_from(a[a[:, -1] >= threshold])

If you want to play with your threshold in the graph, you can build with the whole array and remove edges in the graph. This will be slower than the pre-processed version:

G = nx.Graph()
G.add_weighted_edges_from(a)
G.remove_edges_from(e for e in g.edges if g.get_edge_data(*e) < threshold)

or alternatively

G.remove_edges_from(a[a[:, -1] < threshold]])

or

G.remove_edges_from(r for r in a if r[-1] < threshold)

Whether you construct a reduced graph or reduce it by removing edges, you can get the number and contents of clusters by looking at the components:

>>> nx.number_connected_components(G)
2
>>> for c in nx.connected_components(G):
...     print(c)
{1.0, 2.0, 3.0, 4.0}
{5.0, 6.0, 7.0}

edited Mar 4, 2020 at 19:39

answered Mar 4, 2020 at 15:55

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Waleed Esmail Over a year ago

many thanks, that's works perfectly fine, just another question are there similar tools for C++?

Collectives™ on Stack Overflow

Numpy implementation of simple clustering algorithm

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related