optimizing code for traversing a matrix in python

Question

I am doing a clustering algorithm, in which I have a dataset with (m) rows and (n) features. I create a Jaccard similarity matrix for the dataset that transforms my data set to a (m*m) similarity matrix.

After creating the similarity matrix I run a certain logic on the matrix to find few coordinates.

The logic I wrote actually traverses through half of the elements in the matrix but it takes a heck lot of time. As I am a newbie to python, my code is not too optimized but straight forward.

Please find my code below:

    similarity_dict={}
for (i,j), value in np.ndenumerate(matrix_for_cluster):
    if value>threshold and j>=i:
        if i in similarity_dict:
            similarity_dict[i].append(j)
            if i<>j:
                if j in similarity_dict:
                    similarity_dict[j].append(i)
                else:
                    similarity_dict[j]=[i]                    
        else:  
            similarity_dict[i]=[j]


Matrix for cluster is the similarity matrix, If any of the element's value is greater than the threshold value then the element index is stored in a dictionary.

I would really appreciate any help around optimizing the code

what about using a simple y, x = np.where(matrix_for_cluster > threshold)? That would give you the y and x coordinate vectors for where the condition is satisfied. Is this what you want? — Imanol Luengo
– Imanol Luengo, Commented May 11, 2015 at 7:47
@imaluengo, thanks for the reply. But what I am looking is that for a particular row- for example, (0) i would like to fetch all the y axis coordinate like (1 (0,0),5(0,5),7(0,7)) that exceeds the threshold value and store it in a dictionary under the key (0). Similarly I would like to find all the y axis coordinate of each row and store those coordinate under the key row number. — Sam
– Sam, Commented May 11, 2015 at 8:02
@imaluengo, Okay gotcha, Would it be optimized enough If I do something like . np.where(matrix_for_cluster[i] > 0.4) where i would loop through each row. — Sam
– Sam, Commented May 11, 2015 at 8:06
@Sam: can you provide an example input and expected output? I don't understand what you're doing with the similarity_dict. First you declare it empty, then you try to iterate through it, using the same indice name you did in your fist loop i (is this intentional?) and then you try to assing that key you're looping on, some value of j. But the dict is empty. Then you ask similarity_dict[j] a key which obviously isn't there? You haven't shared the full similarity_dict story here I think. — ljetibo
– ljetibo, Commented May 11, 2015 at 8:08

Imanol Luengo · Accepted Answer · 2015-05-11 13:57:11Z

Looks to me that what you want or are trying to build looks like a graph. In that case, you can use the networkx package:

>>> sim_matrix
array([[0, 1, 0, 2, 2],
       [1, 0, 2, 0, 1],
       [0, 2, 0, 1, 2],
       [2, 0, 1, 0, 0],
       [2, 1, 2, 0, 0]])
>>> sim_matrix[sim_matrix < 2] = 0 # apply your threshold
>>> sim_matrix
array([[0, 0, 0, 2, 2],
       [0, 0, 2, 0, 0],
       [0, 2, 0, 0, 2],
       [2, 0, 0, 0, 0],
       [2, 0, 2, 0, 0]])

with sim_matrix a numpy array:

>>> import networkx as nx
>>> graph = nx.Graph(sim_matrix)
>>> graph.nodes()
[0, 1, 2, 3, 4]
>>> graph.edges(2)
[(2, 1), (2, 4)]
>>> graph.edges(4)
[(4, 0), (4, 2)]

Internally networkx works with python dictionaries, so its pretty much what you are trying to build, but already built for you.

NOTE: this would create a undirectional graph. Change nx.Graph by nx.DiGraph line if you want it directional.

EDIT: updated the example to make sim_matrix actually a symmetric matrix (undirectional graph).

Find more information about networkx and numpy here.

Hope it helps!

Not_a_Golfer · Accepted Answer · 2015-05-11 08:12:01Z

0

This should do the same but probably produce less VM ops:

for (i,j), value in np.ndenumerate(matrix_for_cluster):
    if value>threshold and j>=i:
        similarity_dict.setdefault(i,[]).append(j)
        if i != j:
           similarity_dict.setdefault(j,[]).append(i)

But in general, scipy and numpy (which you're already using I see) have much more optimized similarity of matrices and stuff like that, if you can keep all the work in num/scipy's native stuff, you'll get much much better performance.

answered May 11, 2015 at 8:12

Not_a_Golfer

49.5k15 gold badges130 silver badges95 bronze badges

Collectives™ on Stack Overflow

optimizing code for traversing a matrix in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related