I would like to implement a simple clustering algorithm using python. First I will describe the problem:
I have some points each point is represented by an id, and there a pair probability between each pair i.e prob(id1, id2)=some_value. This is arranged in numpy array of shape [N,3], where N is the number of all possible point pairs. To make more clear here is an example array:
a = np.array([[1,2, 0.9],
[2,3, 0.63],
[3,4, 0.98],
[4,5, 0.1],
[5,6, 0.98],
[6,7, 1]])
where the first two entries are the point ids and the third entry is the probability that they belong to each other.
The clustering problem is connect points that pass probability cut cut=0.5 i.e. points 1,2,3,4 belong to the same cluster and 5,6,7 belong to another cluster. The current solution that I have is make a list of lists(of point ids) i.e l=[[1,2,3,4],[5,6,7]] by looping twice over the unique point ids and array a. Is there a smarter and faster way to do this?
[1,2,3,4], [5,6,7]from the example data you've posted.[1,2]pass the cut>0.5, also[2,3],[3,4]i.e 1,2,3 and 4 are belonging to same cluster, however[4,5]does not pass the cut, similarly for[5,6,7]