0

I need help with improving my script's execution time.

It does what it suppose to do:

  • Reads a file line by line
  • Matches the line with the content of json file
  • Writes both the matching lines with the corresponding information from json file into a new txt file

The problem is with execution time, the file has more than 500,000 lines and the json file contains much more.

How can I optimize this script?

import json
import time
start = time.time()
print start
JsonFile=open('categories.json')
data = json.load(JsonFile)
Annotated_Data={}
FileList = [line.rstrip('\n') for line in open("FilesNamesID.txt")]
for File in FileList:
    for key, value in data.items(): 
        if File == key:
            Annotated_Data[key]=(value)     
with open('Annotated_Files.txt', 'w') as outfile:
    json.dump(Annotated_Data, outfile, indent=4)

end = time.time()
print(end - start)
2
  • You should look at and examine what Time Complexity is and Big O notation. Commented Jun 25, 2019 at 21:21
  • Instead of FileList = [line.rstrip('\n') for line in open("FilesNamesID.txt")] I would directly use for File in open("FilesNamesID.txt"). This avoids creating a 500 thousand lines big list which have to be stored in memory. So only the actual line is loaded into memory. Commented Jun 25, 2019 at 21:43

2 Answers 2

2

There is no need for the nested for loop to look up the File in data. You could replace it with the following code:

for File in FileList:
    if File in data:
        Annotated_Data[File]=data[File]

or with a comprehension:

AnnotatedData = {File: data[File] for File in FileList if File in data} 

You can also avoid copying the contents of the whole FilesNamesID.txt to the new list - you are consuming it line by line anyway - but it would be a relatively minor improvement.

Sign up to request clarification or add additional context in comments.

3 Comments

It is not minor if the file has over 500 thousand lines
Minor improvement relative to the nested loop for each of those 500 000 lines )
Thank you so much for your help, your answer is the best in term of improving the performance. However, it doesn't works if the fileID contains letters, only when the filesID contains only numbers! any ideas?
1

I don't know exact format of your data, but you could try speed-up your script by using set():

json_data = '''
    {
        "file1": "data1",
        "file2": "data2",
        "file3": "data3"
    }
'''


filenames_id_txt = '''
    file1
    file3
'''

import json

data = json.loads(json_data)
lines = [l.strip() for l in filenames_id_txt.splitlines() if l.strip()]

s = set(data.keys())

Annotated_Data = {k: data[k] for k in s.intersection(lines)}

print(json.dumps(Annotated_Data))

Prints:

{"file3": "data3", "file1": "data1"}

EDIT: If I understand your question correctly, you want to find "intersection" between your JSON data and lines in your TXT file.

I chose the set() (doc) to store the JSON keys (set is collection of unique elements). The set() has very fast methods, one of the method is intersection() (doc), which accepts other iterators (e.g. lines from the TXT file) and return a new set with common elements.

I use this new set to construct new dictionary and output it as JSON file.

2 Comments

This is probably going to be difficult for a beginner to understand. But it is probably the fastest thing to do for this problem in Python. Maybe it's worth not just saying "use set()" but rather explain that Python has a set type, and that set operations such as intersection are pretty fast. That way if they know a bit of math, even if they don't know much Python, they can sort of figure out what your code is doing.
@JohnY You have right, I added some explanation at the end of my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.