1

I am doing a clustering algorithm, in which I have a dataset with (m) rows and (n) features. I create a Jaccard similarity matrix for the dataset that transforms my data set to a (m*m) similarity matrix.

After creating the similarity matrix I run a certain logic on the matrix to find few coordinates.

The logic I wrote actually traverses through half of the elements in the matrix but it takes a heck lot of time. As I am a newbie to python, my code is not too optimized but straight forward.

Please find my code below:

    similarity_dict={}
for (i,j), value in np.ndenumerate(matrix_for_cluster):
    if value>threshold and j>=i:
        if i in similarity_dict:
            similarity_dict[i].append(j)
            if i<>j:
                if j in similarity_dict:
                    similarity_dict[j].append(i)
                else:
                    similarity_dict[j]=[i]                    
        else:  
            similarity_dict[i]=[j]


Matrix for cluster is the similarity matrix, If any of the element's value is greater than the threshold value then the element index is stored in a dictionary. 

I would really appreciate any help around optimizing the code

4
  • what about using a simple y, x = np.where(matrix_for_cluster > threshold)? That would give you the y and x coordinate vectors for where the condition is satisfied. Is this what you want? Commented May 11, 2015 at 7:47
  • @imaluengo, thanks for the reply. But what I am looking is that for a particular row- for example, (0) i would like to fetch all the y axis coordinate like (1 (0,0),5(0,5),7(0,7)) that exceeds the threshold value and store it in a dictionary under the key (0). Similarly I would like to find all the y axis coordinate of each row and store those coordinate under the key row number. Commented May 11, 2015 at 8:02
  • @imaluengo, Okay gotcha, Would it be optimized enough If I do something like . np.where(matrix_for_cluster[i] > 0.4) where i would loop through each row. Commented May 11, 2015 at 8:06
  • @Sam: can you provide an example input and expected output? I don't understand what you're doing with the similarity_dict. First you declare it empty, then you try to iterate through it, using the same indice name you did in your fist loop i (is this intentional?) and then you try to assing that key you're looping on, some value of j. But the dict is empty. Then you ask similarity_dict[j] a key which obviously isn't there? You haven't shared the full similarity_dict story here I think. Commented May 11, 2015 at 8:08

2 Answers 2

3

Looks to me that what you want or are trying to build looks like a graph. In that case, you can use the networkx package:

>>> sim_matrix
array([[0, 1, 0, 2, 2],
       [1, 0, 2, 0, 1],
       [0, 2, 0, 1, 2],
       [2, 0, 1, 0, 0],
       [2, 1, 2, 0, 0]])
>>> sim_matrix[sim_matrix < 2] = 0 # apply your threshold
>>> sim_matrix
array([[0, 0, 0, 2, 2],
       [0, 0, 2, 0, 0],
       [0, 2, 0, 0, 2],
       [2, 0, 0, 0, 0],
       [2, 0, 2, 0, 0]])

with sim_matrix a numpy array:

>>> import networkx as nx
>>> graph = nx.Graph(sim_matrix)
>>> graph.nodes()
[0, 1, 2, 3, 4]
>>> graph.edges(2)
[(2, 1), (2, 4)]
>>> graph.edges(4)
[(4, 0), (4, 2)]

Internally networkx works with python dictionaries, so its pretty much what you are trying to build, but already built for you.

NOTE: this would create a undirectional graph. Change nx.Graph by nx.DiGraph line if you want it directional.

EDIT: updated the example to make sim_matrix actually a symmetric matrix (undirectional graph).

Find more information about networkx and numpy here.

Hope it helps!

Sign up to request clarification or add additional context in comments.

Comments

0

This should do the same but probably produce less VM ops:

for (i,j), value in np.ndenumerate(matrix_for_cluster):
    if value>threshold and j>=i:
        similarity_dict.setdefault(i,[]).append(j)
        if i != j:
           similarity_dict.setdefault(j,[]).append(i)

But in general, scipy and numpy (which you're already using I see) have much more optimized similarity of matrices and stuff like that, if you can keep all the work in num/scipy's native stuff, you'll get much much better performance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.