2

I have a json sample like this:

{"ratings": [{

    "TERM": "movie1",       
    "Rating": "3.5",
    "source": "15786"

},
{
    "TERM": "movie2",       
    "Rating": "3.5",
    "source": "15786"
},
{
    "TERM": "Movie1",       
    "Rating": "3.0",
    "source": "15781"
}

]}

Now I want to create a new json file out of this and logic to filter is to ignore a json object if TERM is already present once. So, for this sample the output will be

{"ratings": [{

"TERM": "movie1",       
"Rating": "3.5",
"source": "15786"

},
{
"TERM": "movie2",       
"Rating": "3.5",
"source": "15786"
}
]}

As movie1 is is already present in index 0 we want to ignore index 2.

I came up with this below logic which works fine for small samples. I have a sample with json array size of 10 milliion and this below code takes 2+days to complete. I am wondering if there is much more efficient way to do this:

import json
import io
input1 = "movies.json"
res=[]
resTerms = []
with io.open(input1, encoding="utf8") as json_data:
    d = json.load(json_data)
    print(len(d['ratings']))
    for x in d['ratings']:
        if x['TERM'].lower() not in resTerms:
            res.append(x)
            resTerms.append(x['TERM'].lower())


final ={}
final["ratings"] = res
output =  "myFileSelected.json"
with io.open(output, 'w') as outfile:
    json.dump(final, outfile)
2
  • The most time you spend on read\write operation. Try to use codecs.open instead of io library, it works faster. Commented Jun 11, 2018 at 15:29
  • 1
    You could split each item down into a dictionary of {x['TERM'].lower(): []} and then write out only the first entry of each key rather than checking each TERM against an array of entries. Commented Jun 11, 2018 at 15:29

1 Answer 1

3

The problem here is when you check if the term is already present (i.e. when you check if x['TERM'].lower() not in resTerms:). This is because resTerms is a list and the lookup complexity is O(n) therefore the whole algorithm becomes O(n^2)

The way to solve this is to use a set instead of a list which has lookup coplexity if O(1). Then your loop would look something like this (you also do not need to keep the json file open)

import json
import io
input1 = "movies.json"
res=[]
resTerms = set()
with io.open(input1, encoding="utf8") as json_data:
    d = json.load(json_data)
    print(len(d['ratings']))

for x in d['ratings']:
    if x['TERM'].lower() not in resTerms:
        res.append(x)
        resTerms.add(x['TERM'].lower())

A handy guide of Python data structures and time complexities can be found here: https://wiki.python.org/moin/TimeComplexity

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.