need help to improve my Python script performance that uses nested loop and json file

Question

I need help with improving my script's execution time.

It does what it suppose to do:

Reads a file line by line
Matches the line with the content of json file
Writes both the matching lines with the corresponding information from json file into a new txt file

The problem is with execution time, the file has more than 500,000 lines and the json file contains much more.

How can I optimize this script?

import json
import time
start = time.time()
print start
JsonFile=open('categories.json')
data = json.load(JsonFile)
Annotated_Data={}
FileList = [line.rstrip('\n') for line in open("FilesNamesID.txt")]
for File in FileList:
    for key, value in data.items(): 
        if File == key:
            Annotated_Data[key]=(value)     
with open('Annotated_Files.txt', 'w') as outfile:
    json.dump(Annotated_Data, outfile, indent=4)

end = time.time()
print(end - start)

You should look at and examine what Time Complexity is and Big O notation. — ABC
– ABC, Commented Jun 25, 2019 at 21:21
Instead of FileList = [line.rstrip('\n') for line in open("FilesNamesID.txt")] I would directly use for File in open("FilesNamesID.txt"). This avoids creating a 500 thousand lines big list which have to be stored in memory. So only the actual line is loaded into memory. — nauer
– nauer, Commented Jun 25, 2019 at 21:43

otykhonruk · Accepted Answer · 2019-06-25 21:54:10Z

2

There is no need for the nested for loop to look up the File in data. You could replace it with the following code:

for File in FileList:
    if File in data:
        Annotated_Data[File]=data[File]

or with a comprehension:

AnnotatedData = {File: data[File] for File in FileList if File in data}

You can also avoid copying the contents of the whole FilesNamesID.txt to the new list - you are consuming it line by line anyway - but it would be a relatively minor improvement.

edited Jun 25, 2019 at 21:54

answered Jun 25, 2019 at 21:42

otykhonruk

3111 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

nauer Over a year ago

It is not minor if the file has over 500 thousand lines

otykhonruk Over a year ago

Minor improvement relative to the nested loop for each of those 500 000 lines )

Zain Over a year ago

Thank you so much for your help, your answer is the best in term of improving the performance. However, it doesn't works if the fileID contains letters, only when the filesID contains only numbers! any ideas?

Andrej Kesely · Accepted Answer · 2019-06-25 22:10:49Z

1

I don't know exact format of your data, but you could try speed-up your script by using set():

json_data = '''
    {
        "file1": "data1",
        "file2": "data2",
        "file3": "data3"
    }
'''


filenames_id_txt = '''
    file1
    file3
'''

import json

data = json.loads(json_data)
lines = [l.strip() for l in filenames_id_txt.splitlines() if l.strip()]

s = set(data.keys())

Annotated_Data = {k: data[k] for k in s.intersection(lines)}

print(json.dumps(Annotated_Data))

Prints:

{"file3": "data3", "file1": "data1"}

EDIT: If I understand your question correctly, you want to find "intersection" between your JSON data and lines in your TXT file.

I chose the set() (doc) to store the JSON keys (set is collection of unique elements). The set() has very fast methods, one of the method is intersection() (doc), which accepts other iterators (e.g. lines from the TXT file) and return a new set with common elements.

I use this new set to construct new dictionary and output it as JSON file.

edited Jun 25, 2019 at 22:10

answered Jun 25, 2019 at 21:36

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

2 Comments

John Y Over a year ago

This is probably going to be difficult for a beginner to understand. But it is probably the fastest thing to do for this problem in Python. Maybe it's worth not just saying "use set()" but rather explain that Python has a set type, and that set operations such as intersection are pretty fast. That way if they know a bit of math, even if they don't know much Python, they can sort of figure out what your code is doing.

Andrej Kesely Over a year ago

@JohnY You have right, I added some explanation at the end of my answer.

Collectives™ on Stack Overflow

need help to improve my Python script performance that uses nested loop and json file

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related