I have written a python program that needs to deal with quite large data sets for a machine learning task. I have a train set (about 6 million rows) and a test set (about 2 million rows). So far I my program runs in a reasonable amount of time until I get to the last part of my code. The thing is I have my machine learning algorithm that makes predictions, and I save those predictions into a list. But before I write my predictions to a file I need to do one thing. There are duplicates in my train and test set. I need to find those duplicates in the train set and extract their corresponding label. To achieve this I created a dictionary with my training examples as keys and my labels as values. Afterwards, I create a new list and iterate over my test set and train set. If an example in my test set can be found in my train set append the corresponding labels to my new list, otherwise, append my predictions to my new list.
The actual code I used to achieve the matter I described above:
listed_predictions = list(predictions)
""""creating a dictionary"""
train_dict = dict(izip(train,labels))
result = []
for sample in xrange(len(listed_predictions)):
if test[sample] in train_dict.keys():
result.append(train_dict[test[sample]])
else:
result.append(predictions[sample])
This loop takes roughly 2 million iterations. I thought about numpy arrays, since those should scale better than python lists, but I have no idea how could achieve the same with numpy arrays. Also thought about other optimization solutions like Cython, but before I dive into that, I am hoping that there are low hanging fruits that I, as an inexperienced programmer with no formal computing education, don't see.
Update I have implemented thefourtheye's solution, and it brought my runtime down to about 10 hours, which is fast enough for what I want to achieve. Everybody, thank you for your help and suggestions.