Performance issue on Python

Question

I am having a performance issue on Python. The snippet below has 4 nested loops iterating over an OrderedDict, matrix_col which has 11000 items in it. Another iteration goes over a defaultdict, trans which has also ~11000 items in it. Execution of this process is taking too long. I appreciate if anyone can advise how to improve the performance.

import string
from collections import namedtuple
from collections import defaultdict
from collections import OrderedDict
import time

trans = defaultdict(dict)
...
matrix_col = OrderedDict(sorted(matrix_col.items(), key=lambda t: t[0]))
trans_mat = []
counter = 0

for u1, v1 in matrix_col.items():
    print counter, time.ctime()
    for u2, v2 in matrix_col.items():
        flag = True
        for w1 in trans.keys():
            for w2, c in trans[u1].items():
                if u1 == str(w1) and u2 == str(w2):
                    trans_mat.append([c])    
                    flag = False
        if flag:
            trans_mat.append([0])

trans_mat = np.asarray(trans_mat)
trans_mat = np.reshape(trans_mat, (11000, 11000))

Here is its current performance. It is basically processing 2 items per minute. With this speed it will take over 5 days to form the matrix, trans_mat:

0 Tue Oct  6 11:31:18 2015
1 Tue Oct  6 11:31:46 2015
2 Tue Oct  6 11:32:19 2015
3 Tue Oct  6 11:32:52 2015
4 Tue Oct  6 11:33:19 2015
5 Tue Oct  6 11:33:46 2015

So why are you looping over all keys in trans and all items in trans[u1], when you could just test for u1 and u2 being keys in the trans and trans[u1] dictionaries? — Martijn Pieters
– Martijn Pieters, Commented Oct 6, 2015 at 16:01
Can you provide a sample of the data, and what's going on in that ellipsis? — Rafael Barros
– Rafael Barros, Commented Oct 6, 2015 at 16:11
Hi Martijn, how do I get the value "c" in the last step if I don't iterate over trans[u1].items()? — James Derrick
– James Derrick, Commented Oct 6, 2015 at 17:50
If you're dealing with lots of numeric data to the point where performance is a concern, use either numpy and/or pandas (pandas is probably better for you if you're using OrderedDict to hold a bunch of rows with various fields in some order) — Nick T
– Nick T, Commented Oct 6, 2015 at 17:59

Chad S. · Accepted Answer · 2015-10-07 16:16:28Z

2

You aren't taking advantage of the fast lookup available to dictionaries. Finding a key in a dict is O(1). To fix this you just need to change your algorithm so you aren't iterating all the keys looking for the ones you want..

from itertools import product
trans_mat = [ [trans[u1][u2]] if (u1 in trans) and (u2 in trans[u1]) else [0]
                 for u1 in matrix_col for u2 in matrix_col ]

edited Oct 7, 2015 at 16:16

answered Oct 6, 2015 at 16:30

Chad S.

6,66918 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Alexander Trakhimenok Over a year ago

If he needs to get empty values then yes, that's a good solution.

James Derrick Over a year ago

Hi Chad, thanks for the response. matrix_col is a sorted set of items of type "State". Here is an example of "State": 33.594994,-84.72793,9978.2,280,60,270 1**

James Derrick Over a year ago

Hi Chad, thanks for the response. The reason why I am doing string conversion is because I need to form a sorted and consistent set of row and column labels for matrix operations. So, I am converting from "the other type" to String first, so that it is sortable. Then I am sorting based on keys and creating an OrderedDict. I am basically unable to sort if I don't turn it into String.

alexis Over a year ago

Use the same type for keys and for sorting-- make the keys strings too, if you think you must. And then use key lookups. You're shooting yourself in the foot.

Chad S. Over a year ago

Why aren't State items sortable?

|

Alexander Trakhimenok · Accepted Answer · 2015-10-06 16:23:26Z

1

Without context it's hard to understand logic and what you are trying to achieve but you should considering changing algorithm to iterate over trans first and then check the trans_mat. Something like:

for w1, t_val in trans.items():
    w1_is_in_matrix_col = str(w1) in matrix_col
    for w2, c in t_val.items():
        if w1_is_in_matrix_col and str(w2) in matrix_col:
            trans_mat.append([c])
        else:
            trans_mat.append([0])

Theoretically you can use list comprehension here and that would give you some boost as well (but minor comparing to current inefficiency).

answered Oct 6, 2015 at 16:23

Alexander Trakhimenok

6,3282 gold badges31 silver badges53 bronze badges

Collectives™ on Stack Overflow

Performance issue on Python

2 Answers 2

9 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related