1

I have a program that processes a csv file. The contents of the CSV is as follows

lines = [
 [id_A, val1, val2, ..., valn],
 [id_A, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_C, val1, val2, ..., valn],
 [id_C, val1, val2, ..., valn],
 ]

I am building a dictionary that looks like

my_dict = {
 'id_A': ['many', 'values'],
 'id_B': ['many', ''more', 'values']
 'id_C': ['some', 'other', 'values']}

My current implementation looks like

for line in lines:
    log_id = line[0]
        if log_id not in my_dict.keys():
            datablock = lines[1:]
            my_dict[log_id] = datablock
        else:
            my_dict[log_id].append(lines[1:])

With close to a million lines in the csv, the program starts to slow down very significantly once there are a couple thousand entries in the dictionary. I have been debugging it with a spattering of print statements, and the bottleneck seems to be here in the if log_id not in my_dict.keys(): line

I tried using a seperate list for keeping track of the ids already in the dictionary, but that did not seem to help.

Could using set here work, or is that option out since it changes each loop and would need to be reconstructed?

1
  • Did you mean to append additional lists, or extend? Commented Jun 19, 2014 at 17:33

1 Answer 1

12

You are creating a list of all keys each time. Remove the dict.keys() call, it is slowing you down but is not needed:

if log_id not in my_dict:

Dictionaries support membership testing directly, and does so in O(1) time. dict.keys() returns a new list, however, and membership testing on a list is not efficient (it takes O(N) time). So for each membership test, your code would loop over all keys to produce a new list object, then loop over that list again to find a match.

You can simplify your code a bit by using dict.setdefault():

for line in lines:
    log_id = line[0]
    my_dict.setdefault(log_id, []).append(lines[1:])

dict.setdefault() returns the value associated with the given key, and if the key is missing, uses the second argument as the default value (adding the key and value to the dictionary).

Alternatively, use a collections.defaultdict() object instead of your plain dictionary:

from collections import defaultdict

mydict = defaultdict(list)

for line in lines:
    log_id = line[0]
    my_dict[log_id].append(lines[1:])

A defaultdict is a simple dict subclass that will call the configured factory every time a key is missing; here list() is called to create a new value for missing keys the moment you try to access one.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.