Efficient way to filter a json array of json objects in python

Question

I have a json sample like this:

{"ratings": [{

    "TERM": "movie1",       
    "Rating": "3.5",
    "source": "15786"

},
{
    "TERM": "movie2",       
    "Rating": "3.5",
    "source": "15786"
},
{
    "TERM": "Movie1",       
    "Rating": "3.0",
    "source": "15781"
}

]}

Now I want to create a new json file out of this and logic to filter is to ignore a json object if TERM is already present once. So, for this sample the output will be

{"ratings": [{

"TERM": "movie1",       
"Rating": "3.5",
"source": "15786"

},
{
"TERM": "movie2",       
"Rating": "3.5",
"source": "15786"
}
]}

As movie1 is is already present in index 0 we want to ignore index 2.

I came up with this below logic which works fine for small samples. I have a sample with json array size of 10 milliion and this below code takes 2+days to complete. I am wondering if there is much more efficient way to do this:

import json
import io
input1 = "movies.json"
res=[]
resTerms = []
with io.open(input1, encoding="utf8") as json_data:
    d = json.load(json_data)
    print(len(d['ratings']))
    for x in d['ratings']:
        if x['TERM'].lower() not in resTerms:
            res.append(x)
            resTerms.append(x['TERM'].lower())


final ={}
final["ratings"] = res
output =  "myFileSelected.json"
with io.open(output, 'w') as outfile:
    json.dump(final, outfile)

The most time you spend on read\write operation. Try to use codecs.open instead of io library, it works faster. — Vladyslav
– Vladyslav, Commented Jun 11, 2018 at 15:29
You could split each item down into a dictionary of {x['TERM'].lower(): []} and then write out only the first entry of each key rather than checking each TERM against an array of entries. — Lane Terry
– Lane Terry, Commented Jun 11, 2018 at 15:29

kosnik · Accepted Answer · 2018-06-11 16:58:48Z

3

The problem here is when you check if the term is already present (i.e. when you check if x['TERM'].lower() not in resTerms:). This is because resTerms is a list and the lookup complexity is O(n) therefore the whole algorithm becomes O(n^2)

The way to solve this is to use a set instead of a list which has lookup coplexity if O(1). Then your loop would look something like this (you also do not need to keep the json file open)

import json
import io
input1 = "movies.json"
res=[]
resTerms = set()
with io.open(input1, encoding="utf8") as json_data:
    d = json.load(json_data)
    print(len(d['ratings']))

for x in d['ratings']:
    if x['TERM'].lower() not in resTerms:
        res.append(x)
        resTerms.add(x['TERM'].lower())

A handy guide of Python data structures and time complexities can be found here: https://wiki.python.org/moin/TimeComplexity

edited Jun 11, 2018 at 16:58

answered Jun 11, 2018 at 15:56

kosnik

2,47413 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Efficient way to filter a json array of json objects in python

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related