I am doing a clustering algorithm, in which I have a dataset with (m) rows and (n) features. I create a Jaccard similarity matrix for the dataset that transforms my data set to a (m*m) similarity matrix.
After creating the similarity matrix I run a certain logic on the matrix to find few coordinates.
The logic I wrote actually traverses through half of the elements in the matrix but it takes a heck lot of time. As I am a newbie to python, my code is not too optimized but straight forward.
Please find my code below:
similarity_dict={}
for (i,j), value in np.ndenumerate(matrix_for_cluster):
if value>threshold and j>=i:
if i in similarity_dict:
similarity_dict[i].append(j)
if i<>j:
if j in similarity_dict:
similarity_dict[j].append(i)
else:
similarity_dict[j]=[i]
else:
similarity_dict[i]=[j]
Matrix for cluster is the similarity matrix, If any of the element's value is greater than the threshold value then the element index is stored in a dictionary.
I would really appreciate any help around optimizing the code
y, x = np.where(matrix_for_cluster > threshold)? That would give you theyandxcoordinate vectors for where the condition is satisfied. Is this what you want?similarity_dict. First you declare it empty, then you try to iterate through it, using the same indice name you did in your fist loopi(is this intentional?) and then you try to assing that key you're looping on, some value ofj. But the dict is empty. Then you asksimilarity_dict[j]a key which obviously isn't there? You haven't shared the fullsimilarity_dictstory here I think.