Python speed up: checking if value in list

Question

I have a program that processes a csv file. The contents of the CSV is as follows

lines = [
 [id_A, val1, val2, ..., valn],
 [id_A, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_B, val1, val2, ..., valn],
 [id_C, val1, val2, ..., valn],
 [id_C, val1, val2, ..., valn],
 ]

I am building a dictionary that looks like

my_dict = {
 'id_A': ['many', 'values'],
 'id_B': ['many', ''more', 'values']
 'id_C': ['some', 'other', 'values']}

My current implementation looks like

for line in lines:
    log_id = line[0]
        if log_id not in my_dict.keys():
            datablock = lines[1:]
            my_dict[log_id] = datablock
        else:
            my_dict[log_id].append(lines[1:])

With close to a million lines in the csv, the program starts to slow down very significantly once there are a couple thousand entries in the dictionary. I have been debugging it with a spattering of print statements, and the bottleneck seems to be here in the if log_id not in my_dict.keys(): line

I tried using a seperate list for keeping track of the ids already in the dictionary, but that did not seem to help.

Could using set here work, or is that option out since it changes each loop and would need to be reconstructed?

Did you mean to append additional lists, or extend?

Martijn Pieters
– Martijn Pieters

2014-06-19 17:33:02 +00:00
Commented Jun 19, 2014 at 17:33 — Martijn Pieters
– Martijn Pieters, Commented Jun 19, 2014 at 17:33

Martijn Pieters · Accepted Answer · 2014-06-19 16:02:50Z

You are creating a list of all keys each time. Remove the dict.keys() call, it is slowing you down but is not needed:

if log_id not in my_dict:

Dictionaries support membership testing directly, and does so in O(1) time. dict.keys() returns a new list, however, and membership testing on a list is not efficient (it takes O(N) time). So for each membership test, your code would loop over all keys to produce a new list object, then loop over that list again to find a match.

You can simplify your code a bit by using dict.setdefault():

for line in lines:
    log_id = line[0]
    my_dict.setdefault(log_id, []).append(lines[1:])

dict.setdefault() returns the value associated with the given key, and if the key is missing, uses the second argument as the default value (adding the key and value to the dictionary).

Alternatively, use a collections.defaultdict() object instead of your plain dictionary:

from collections import defaultdict

mydict = defaultdict(list)

for line in lines:
    log_id = line[0]
    my_dict[log_id].append(lines[1:])

A defaultdict is a simple dict subclass that will call the configured factory every time a key is missing; here list() is called to create a new value for missing keys the moment you try to access one.

Collectives™ on Stack Overflow

Python speed up: checking if value in list

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related