2

I have written a python program that needs to deal with quite large data sets for a machine learning task. I have a train set (about 6 million rows) and a test set (about 2 million rows). So far I my program runs in a reasonable amount of time until I get to the last part of my code. The thing is I have my machine learning algorithm that makes predictions, and I save those predictions into a list. But before I write my predictions to a file I need to do one thing. There are duplicates in my train and test set. I need to find those duplicates in the train set and extract their corresponding label. To achieve this I created a dictionary with my training examples as keys and my labels as values. Afterwards, I create a new list and iterate over my test set and train set. If an example in my test set can be found in my train set append the corresponding labels to my new list, otherwise, append my predictions to my new list.

The actual code I used to achieve the matter I described above:

listed_predictions =  list(predictions)

    """"creating a dictionary"""
    train_dict = dict(izip(train,labels))


    result = []
    for sample in xrange(len(listed_predictions)):
        if test[sample] in train_dict.keys():
            result.append(train_dict[test[sample]])
        else:
            result.append(predictions[sample])

This loop takes roughly 2 million iterations. I thought about numpy arrays, since those should scale better than python lists, but I have no idea how could achieve the same with numpy arrays. Also thought about other optimization solutions like Cython, but before I dive into that, I am hoping that there are low hanging fruits that I, as an inexperienced programmer with no formal computing education, don't see.

Update I have implemented thefourtheye's solution, and it brought my runtime down to about 10 hours, which is fast enough for what I want to achieve. Everybody, thank you for your help and suggestions.

1

4 Answers 4

4

Two suggestions,

  1. To check if a key is in a dict, simply use in and the object (this happens in O(1))

    if key in dict:
    
  2. Use comprehensions whenever possible.

So, your code becomes like this

result = [train_dict.get(test[sample], predictions[sample]) for sample in xrange(len(listed_predictions))]
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for your answer. I didn't know that list comprehensions were faster than normal for loops! I cannot try this out right now, but I will try it out.
Why access test[sample]? why not result = [train_dict[sample] if sample in train_dict else[predictions[sample] for sample in test] with a test = test[:len(listed_predictions] right at the start?
@inspectorG4dget sample is just a loop index, right. So, it might not be there in train_dict
Ah! but if you look at my code, you'll see that sample is each item in test
4

test[sample] in train_dict.keys() is extremely inefficient. It iterates over all the keys of train_dict looking for the value, when the whole point of dictionaries is supposed to be fast key lookup.

Use test[sample] in train_dict instead -- that change alone might solve your performance issues.

Also, do you actually need results to be a list? It may or may not help performance if you just avoid creating that 2 million entry list. How about:

def results(sample):
    item = test[sample]
    return train_dict[item] if item in train_dict else predictions[sample]

Something to compare for performance:

def results(sample):
    # advantage - only looks up the key once
    # disadvantage - accesses `predictions` whether needed or not,
    # so could be cache inefficient
    return train_dict.get(test[sample], predictions[sample])

We can try to get both advantages:

def results(sample):
    # disadvantage - goes wrong if train_dict contains any value that's false
    return train_dict.get(test[sample]) or performance[sample]

def results(sample):
    # disadvantage - goes wrong if train_dict contains any None value
    value = train_dict.get(test[sample])
    return performance[sample] if value is None else value

def results(sample):
    # disadvantage - exception might be slow, and might be the common case
    try:
        return train_dict[test[sample]]
    except KeyError:
        return predictions[sample]

default_value = object()
def results(sample):
    # disadvantage - kind of obscure
    value = train_dict.get(test[sample], default_value)
    return performance[sample] if value is default_value else value

Of course, all of these functions assume that test and predictions will remain unmodified for as long as you use the results function.

1 Comment

Thank you for the answer, I'll have a look into this.
1

not sure if this will give you a performance boost, but I guess you can try it:

def look_up( x ):
    try:
        return train_dict[ test[ x ] ]
    except KeyError:
        return predictions[ x ]

result = map ( look_up, xrange( len( listed_predictions ) ) )

2 Comments

Thanks for the answer, I'll have a look.
Exception handlers are fast to setup but costly to execute, so unless test[x] is expected to exist in train_dict for most values of x this might end up being even slower. Also you have the overhead of the function call AND the overhead of two global name lookups.
1

In Python 2.7, assuming you can form a dictionary of training samples and test samples as:

dict1 = dict(izip(train_samples, labels))
dict2 = dict(izip(test_samples, predictions))

then:

result = dict(dict2.items() + [(k,v) for k,v in dict1.viewitems() if k in dict2])

Gives you the dictionary that always uses the known labels from the training set but is limited in range to only samples from the test set. You can get this back into a list if need be.

There may be faster implementations using Series from pandas or numpy with where and unique.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.