I have a program that processes a csv file. The contents of the CSV is as follows
lines = [
[id_A, val1, val2, ..., valn],
[id_A, val1, val2, ..., valn],
[id_B, val1, val2, ..., valn],
[id_B, val1, val2, ..., valn],
[id_B, val1, val2, ..., valn],
[id_B, val1, val2, ..., valn],
[id_C, val1, val2, ..., valn],
[id_C, val1, val2, ..., valn],
]
I am building a dictionary that looks like
my_dict = {
'id_A': ['many', 'values'],
'id_B': ['many', ''more', 'values']
'id_C': ['some', 'other', 'values']}
My current implementation looks like
for line in lines:
log_id = line[0]
if log_id not in my_dict.keys():
datablock = lines[1:]
my_dict[log_id] = datablock
else:
my_dict[log_id].append(lines[1:])
With close to a million lines in the csv, the program starts to slow down very significantly once there are a couple thousand entries in the dictionary. I have been debugging it with a spattering of print statements, and the bottleneck seems to be here in the if log_id not in my_dict.keys(): line
I tried using a seperate list for keeping track of the ids already in the dictionary, but that did not seem to help.
Could using set here work, or is that option out since it changes each loop and would need to be reconstructed?