Python function performance

Question

I have 130 lines of code in which part except from line 79 to 89 work fine like compiles in ~0.16 seconds however after adding function which is 10 lines(between 79-89) it works in 70-75 seconds. In that function the data file(u.data) is 100000 lines of numerical data in this format:

 >196   242  3  881250949

4 grouped numbers in every line. The thing is that when I ran that function in another Python file while testing (before implementing it in the main program) it showed that it works in 0.15 seconds however when I implemented it in main one (same code) it takes whole program 70 seconds almost.

Here is my code:

""" Assignment 5: Movie Reviews
    Date: 30.12.2016
"""
import os.path
import time
start_time = time.time()

""" FUNCTIONS """


# Getting film names in film folder
def get_film_name():
    name = ''
    for word in read_data.split(' '):
        if ('(' in word) == False:
            name += word + ' '
        else:
            break
    return name.strip(' ')


# Function for removing date for comparison
def throw_date(string):
    a_list = string.split()[:-1]
    new_string = ''
    for i in a_list:
        new_string += i + ' '
    return new_string.strip(' ')


def film_genre(film_name):
    oboist = []
    genr_list = ['unknown', 'Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama',
                 'Fantasy',
                 'Movie-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
    for item in u_item_list:
        if throw_date(str(item[1])) == film_name:
            for i in range(4, len(item)):
                oboist.append(item[i])
    dictionary = dict(zip(genr_list, oboist))
    genres = ''
    for key, value in dictionary.items():
        if value == '1':
            genres += key + ' '
    return genres.strip(' ')


def film_link(film_name):
    link = ''
    for item in u_item_list:
        if throw_date(str(item[1])) == film_name:
            link += item[3]
    return link


def film_review(film_name):
    review = ''
    for r, d, filess in os.walk('film'):
        for fs in filess:
            fullpat = os.path.join(r, fs)
            with open(fullpat, 'r') as a_file:
                data = a_file.read()
                if str(film_name).lower() in str(data.split('\n', 1)[0]).lower():
                    for i, line in enumerate(data):
                        if i > 1:
                            review += line
            a_file.close()
    return review


def film_id(film_name):
    for film in u_item_list:
        if throw_date(film[1]) == film_name:
            return film[0]


def total_user_and_rate(film_name):
    rate = 0
    user = 0
    with open('u.data', 'r') as data_file:
        rate_data = data_file.read()
        for l in rate_data.split('\n'):
            if l.split('\t')[1] == film_id(film_name):
                user += 1
                rate += int(l.split('\t')[2])
    data_file.close()
    print('Total User:' + str(int(user)) + '\nTotal Rate: ' + str(rate / user))



""" MAIN CODE"""
review_file = open("review.txt", 'w')
film_name_list = []
# Look for txt files and extract the film names
for root, dirs, files in os.walk('film'):
    for f in files:
        fullpath = os.path.join(root, f)
        with open(fullpath, 'r') as file:
            read_data = file.read()
            film_name_list.append(get_film_name())
        file.close()

with open('u.item', 'r') as item_file:
    item_data = item_file.read()
item_file.close()

u_item_list = []
for line in item_data.split('\n'):
    temp = [word for word in line.split('|')]
    u_item_list.append(temp)


film_name_list = [i.lower() for i in film_name_list]
updated_film_list = []
print(u_item_list)

# Operation for review.txt
for film_data_list in u_item_list:
    if throw_date(str(film_data_list[1]).lower()) in film_name_list:
        strin = film_data_list[0] + " " + film_data_list[1] + " is found in the folder" + '\n'
        print(film_data_list[0] + " " + film_data_list[1] + " is found in the folder")
        updated_film_list.append(throw_date(str(film_data_list[1])))
        review_file.write(strin)
    else:
        strin = film_data_list[0] + " " + film_data_list[1] + " is not found in the folder. Look at " + film_data_list[
            3] + '\n'
        print(film_data_list[0] + " " + film_data_list[1] + " is not found in the folder. Look at " + film_data_list[3])
        review_file.write(strin)

total_user_and_rate('Titanic')

print("time elapsed: {:.2f}s".format(time.time() - start_time))

And my question is what can be the reason for that? Is the function

("total_user_and_rate(film_name)")

problematic? Or can there be other problems in other parts? Or is it normal because of the file?

I'm not sure I understand what you're asking. Are you asking why a piece of code runs fast with a small test file, but "slowly" with a 100,000 large production file? — Disillusioned
– Disillusioned, Commented Jan 2, 2017 at 4:13
No i ask that when i test that function with the same file(u.data) in another py file its runtime is not more than 2 seconds. However when i insert it into my main program and use as its part, and use it for the same file (u.data) ,in this case it makes my program work in 70 seconds which had 0.1-2 seconds runtime. Like is the function problematic or program? — Habil Ganbarli
– Habil Ganbarli, Commented Jan 2, 2017 at 4:19
Does each column always have the same number of digits? Can you include a few more lines of the data? — wwii
– wwii, Commented Jan 2, 2017 at 4:41
Do you have any restrictions on what you can use for the assignment? — wwii
– wwii, Commented Jan 2, 2017 at 5:24
No we don't have any restriction and my only problem was that time problem and thanks for you i solved it. — Habil Ganbarli
– Habil Ganbarli, Commented Jan 2, 2017 at 5:28

wwii · Accepted Answer · 2017-01-02 04:30:07Z

2

I see a couple of unnecessary things.

You call film_id(film_name) inside the loop for every line of the file, you really only need to call it once before the loop.

You don't need to read the file, then split it to iterate over it, just iterate over the lines of the file.

You split each line twice, just do it once

Refactored for these changes:

def total_user_and_rate(film_name):
    rate = 0
    user = 0
    f_id = film_id(film_name)
    with open('u.data', 'r') as data_file:
        for line in data_file:
            line = line.split('\t')
            if line[1] == f_id:
                user += 1
                rate += int(line[2])
    data_file.close()
    print('Total User:' + str(int(user)) + '\nTotal Rate: ' + str(rate / user))

answered Jan 2, 2017 at 4:30

wwii

23.9k7 gold badges42 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:01:49Z

1

In your test you were probably testing with a much smaller u.item file. Or doing something else to ensure film_id was much quicker. (By quicker, I mean it probably ran on the nanosecond scale.)

The problem you have is that computers are so fast you didn't realise when you'd actually made a big mistake doing something that runs "slowly" in computer time.

If your if l.split('\t')[1] == film_id(film_name): line takes 1 millisecond, then when processing a 100,000 line u.data file, you could expect your total_user_and_rate function to take 100 seconds.

The problem is that film_id iterates all your films to find the correct id for every single line in u.data. You'd be lucky, if the the film_id you're looking for is near the beginning of u_item_list because then the function would return in probably less than a nanosecond. But as soon as you run your new function for a film near the end of u_item_list, you'll notice performance problems.

wwii has explained how to optimise the total_user_and_rate function. But you could also gain performance improvements by changing u_item_list to use a dictionary. This would improve the performance of functions like film_id from O(n) complexity to O(1). I.e. it would still run on the nanosecond scale no matter how many films are included.

edited May 23, 2017 at 12:01

CommunityBot

11 silver badge

answered Jan 2, 2017 at 4:55

Disillusioned

14.9k3 gold badges47 silver badges80 bronze badges

2 Comments

Habil Ganbarli Over a year ago

Thanks for your advise. i really appreciate it

wwii Over a year ago

The longer I looked at all that was going on the more I was convinced that preprocessing all the files once into dictionaries or even a data base would be worthwhile. But then again it is a homework assignment.

Collectives™ on Stack Overflow

Python function performance

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related