3

The following loop creates a giant bottleneck in my program. Particularly since records can be over 500k.

records = [item for sublist in records for item in sublist] #flatten the list
for rec in records:
    if len(rec) > 5:
        tag = '%s.%s' %(rec[4], rec[5].strip())
        if tag in mydict:
            mydict[tag][0] += 1
            mydict[tag][1].add(rec[6].strip())
        else:
            mydict[tag] = [1, set(rec[6].strip())]

I don't see a way that I could do this with a dictionary/list comprehension, and I'm not sure calling map would do me much good. Is there any way to optimize this loop?

Edit: The dictionary contains information about certain operations occurring in a program. rec[4] is the package which contains the operation and rec[5] is the name of the operation. The raw logs contains an int instead of the actual name, so when the log files are read into the list, the int is looked up and replaced with the operation name. The incremental counter counts how many times the operations was executed and the set contains the parameters for the operation. I am using a set because I don't want duplicates for the parameters. The strip is simply to remove white space. The existence of this white space is unpredictable in rec[6], but rether consistant in rec[4] and rec[5].

6
  • 2
    I don't see an easy way to optimize -- However, I feel obliged to point out that set(rec[6].strip()) is likely to create a set of single character strings -- which doesn't seem to jive well with set.add(rec[6].strip()) which adds a string to the set. Commented Sep 5, 2014 at 20:32
  • 4
    Just making records a genexp would help. Commented Sep 5, 2014 at 20:32
  • flattening the nested list doubles your processing time, doesn't it? Commented Sep 5, 2014 at 20:40
  • That portion seemed to run faster than the loop. It is a bit tricky to avoid that since it is the filtered contents of multiple files and currently the function that does the filtering processes one file at a time, but I suppose the function could be changed to handle multiple ones and avoid the flattening problem. Commented Sep 5, 2014 at 20:45
  • Why are you using a string as key? Could it be a tuple? Also what is the average length of rec[4], rec[5], rec[6] and what kind of characters are you stripping? What is the size of the dictionnary at the end of the loop? Why do you use a set and increment your count even if the item is already in the set? Commented Sep 6, 2014 at 9:03

2 Answers 2

6

Instead of flattening such a huge list you can directly iterate over its flattened iterator using itertools.chain.from_iterable.

from itertools import chain

for rec in chain.from_iterable(records):
    #rest of the code

This is around 3X times faster than the equivalent nested for-loop based genexp version as well:

In [13]: records = [[None]*500]*10000

In [14]: %%timeit
    ...: for rec in chain.from_iterable(records): pass
    ...: 
10 loops, best of 3: 54.7 ms per loop

In [15]: %%timeit
    ...: for rec in (item for sublist in records for item in sublist): pass
    ...: 
10 loops, best of 3: 170 ms per loop

In [16]: %%timeit #Your version
    ...: for rec in [item for sublist in records for item in sublist]: pass
    ...: 
1 loops, best of 3: 249 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

3

I don't know if this will make it faster or not, but instead of having ...

if tag in mydict:
    mydict[tag][0] += 1
    mydict[tag][1].add(rec[6].strip())
else:
    mydict[tag] = [1, set(rec[6].strip())]

you can try ...

element = mydict.setdefault(tag, [0, set()])
element[0] += 1
element[1].add(rec[6], strip())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.