I am having a performance issue on Python. The snippet below has 4 nested loops iterating over an OrderedDict, matrix_col which has 11000 items in it. Another iteration goes over a defaultdict, trans which has also ~11000 items in it. Execution of this process is taking too long. I appreciate if anyone can advise how to improve the performance.
import string
from collections import namedtuple
from collections import defaultdict
from collections import OrderedDict
import time
trans = defaultdict(dict)
...
matrix_col = OrderedDict(sorted(matrix_col.items(), key=lambda t: t[0]))
trans_mat = []
counter = 0
for u1, v1 in matrix_col.items():
print counter, time.ctime()
for u2, v2 in matrix_col.items():
flag = True
for w1 in trans.keys():
for w2, c in trans[u1].items():
if u1 == str(w1) and u2 == str(w2):
trans_mat.append([c])
flag = False
if flag:
trans_mat.append([0])
trans_mat = np.asarray(trans_mat)
trans_mat = np.reshape(trans_mat, (11000, 11000))
Here is its current performance. It is basically processing 2 items per minute. With this speed it will take over 5 days to form the matrix, trans_mat:
0 Tue Oct 6 11:31:18 2015
1 Tue Oct 6 11:31:46 2015
2 Tue Oct 6 11:32:19 2015
3 Tue Oct 6 11:32:52 2015
4 Tue Oct 6 11:33:19 2015
5 Tue Oct 6 11:33:46 2015
transand all items intrans[u1], when you could just test foru1andu2being keys in thetransandtrans[u1]dictionaries?numpyand/orpandas(pandasis probably better for you if you're usingOrderedDictto hold a bunch of rows with various fields in some order)