Tips for speeding up my python code

Question

I have written a python program that needs to deal with quite large data sets for a machine learning task. I have a train set (about 6 million rows) and a test set (about 2 million rows). So far I my program runs in a reasonable amount of time until I get to the last part of my code. The thing is I have my machine learning algorithm that makes predictions, and I save those predictions into a list. But before I write my predictions to a file I need to do one thing. There are duplicates in my train and test set. I need to find those duplicates in the train set and extract their corresponding label. To achieve this I created a dictionary with my training examples as keys and my labels as values. Afterwards, I create a new list and iterate over my test set and train set. If an example in my test set can be found in my train set append the corresponding labels to my new list, otherwise, append my predictions to my new list.

The actual code I used to achieve the matter I described above:

listed_predictions =  list(predictions)

    """"creating a dictionary"""
    train_dict = dict(izip(train,labels))


    result = []
    for sample in xrange(len(listed_predictions)):
        if test[sample] in train_dict.keys():
            result.append(train_dict[test[sample]])
        else:
            result.append(predictions[sample])

This loop takes roughly 2 million iterations. I thought about numpy arrays, since those should scale better than python lists, but I have no idea how could achieve the same with numpy arrays. Also thought about other optimization solutions like Cython, but before I dive into that, I am hoping that there are low hanging fruits that I, as an inexperienced programmer with no formal computing education, don't see.

Update I have implemented thefourtheye's solution, and it brought my runtime down to about 10 hours, which is fast enough for what I want to achieve. Everybody, thank you for your help and suggestions.

codereview.stackexchange.com

dsgriffin
– dsgriffin

2013-12-14 14:06:03 +00:00
Commented Dec 14, 2013 at 14:06 — dsgriffin
– dsgriffin, Commented Dec 14, 2013 at 14:06

thefourtheye · Accepted Answer · 2013-12-14 14:45:15Z

4

Two suggestions,

To check if a key is in a dict, simply use in and the object (this happens in O(1))
```
if key in dict:
```
Use comprehensions whenever possible.

So, your code becomes like this

result = [train_dict.get(test[sample], predictions[sample]) for sample in xrange(len(listed_predictions))]

edited Dec 14, 2013 at 14:45

answered Dec 14, 2013 at 14:11

thefourtheye

241k53 gold badges466 silver badges505 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Learner Over a year ago

Thank you for your answer. I didn't know that list comprehensions were faster than normal for loops! I cannot try this out right now, but I will try it out.

inspectorG4dget Over a year ago

Why access test[sample]? why not result = [train_dict[sample] if sample in train_dict else[predictions[sample] for sample in test] with a test = test[:len(listed_predictions] right at the start?

thefourtheye Over a year ago

@inspectorG4dget sample is just a loop index, right. So, it might not be there in train_dict

inspectorG4dget Over a year ago

Ah! but if you look at my code, you'll see that sample is each item in test

Steve Jessop · Accepted Answer · 2013-12-14 14:31:12Z

test[sample] in train_dict.keys() is extremely inefficient. It iterates over all the keys of train_dict looking for the value, when the whole point of dictionaries is supposed to be fast key lookup.

Use test[sample] in train_dict instead -- that change alone might solve your performance issues.

Also, do you actually need results to be a list? It may or may not help performance if you just avoid creating that 2 million entry list. How about:

def results(sample):
    item = test[sample]
    return train_dict[item] if item in train_dict else predictions[sample]

Something to compare for performance:

def results(sample):
    # advantage - only looks up the key once
    # disadvantage - accesses `predictions` whether needed or not,
    # so could be cache inefficient
    return train_dict.get(test[sample], predictions[sample])

We can try to get both advantages:

def results(sample):
    # disadvantage - goes wrong if train_dict contains any value that's false
    return train_dict.get(test[sample]) or performance[sample]

def results(sample):
    # disadvantage - goes wrong if train_dict contains any None value
    value = train_dict.get(test[sample])
    return performance[sample] if value is None else value

def results(sample):
    # disadvantage - exception might be slow, and might be the common case
    try:
        return train_dict[test[sample]]
    except KeyError:
        return predictions[sample]

default_value = object()
def results(sample):
    # disadvantage - kind of obscure
    value = train_dict.get(test[sample], default_value)
    return performance[sample] if value is default_value else value

Of course, all of these functions assume that test and predictions will remain unmodified for as long as you use the results function.

behzad.nouri · Accepted Answer · 2013-12-14 14:18:46Z

1

not sure if this will give you a performance boost, but I guess you can try it:

def look_up( x ):
    try:
        return train_dict[ test[ x ] ]
    except KeyError:
        return predictions[ x ]

result = map ( look_up, xrange( len( listed_predictions ) ) )

answered Dec 14, 2013 at 14:18

behzad.nouri

78.5k18 gold badges130 silver badges127 bronze badges

2 Comments

Learner Over a year ago

Thanks for the answer, I'll have a look.

bruno desthuilliers Over a year ago

Exception handlers are fast to setup but costly to execute, so unless test[x] is expected to exist in train_dict for most values of x this might end up being even slower. Also you have the overhead of the function call AND the overhead of two global name lookups.

exogeographer · Accepted Answer · 2013-12-15 05:35:44Z

1

In Python 2.7, assuming you can form a dictionary of training samples and test samples as:

dict1 = dict(izip(train_samples, labels))
dict2 = dict(izip(test_samples, predictions))

then:

result = dict(dict2.items() + [(k,v) for k,v in dict1.viewitems() if k in dict2])

Gives you the dictionary that always uses the known labels from the training set but is limited in range to only samples from the test set. You can get this back into a list if need be.

There may be faster implementations using Series from pandas or numpy with where and unique.

answered Dec 15, 2013 at 5:35

exogeographer

3591 silver badge5 bronze badges

Collectives™ on Stack Overflow

Tips for speeding up my python code

4 Answers 4

4 Comments

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related