I have a json sample like this:
{"ratings": [{
"TERM": "movie1",
"Rating": "3.5",
"source": "15786"
},
{
"TERM": "movie2",
"Rating": "3.5",
"source": "15786"
},
{
"TERM": "Movie1",
"Rating": "3.0",
"source": "15781"
}
]}
Now I want to create a new json file out of this and logic to filter is to ignore a json object if TERM is already present once. So, for this sample the output will be
{"ratings": [{
"TERM": "movie1",
"Rating": "3.5",
"source": "15786"
},
{
"TERM": "movie2",
"Rating": "3.5",
"source": "15786"
}
]}
As movie1 is is already present in index 0 we want to ignore index 2.
I came up with this below logic which works fine for small samples. I have a sample with json array size of 10 milliion and this below code takes 2+days to complete. I am wondering if there is much more efficient way to do this:
import json
import io
input1 = "movies.json"
res=[]
resTerms = []
with io.open(input1, encoding="utf8") as json_data:
d = json.load(json_data)
print(len(d['ratings']))
for x in d['ratings']:
if x['TERM'].lower() not in resTerms:
res.append(x)
resTerms.append(x['TERM'].lower())
final ={}
final["ratings"] = res
output = "myFileSelected.json"
with io.open(output, 'w') as outfile:
json.dump(final, outfile)
iolibrary, it works faster.{x['TERM'].lower(): []}and then write out only the first entry of each key rather than checking each TERM against an array of entries.