I am trying to convert a very large json file to csv.The code works well with smaller files but taking so much time while running the same code on larger files i tested it first on 91 mb file containing 80,000 entries and it took around 45 minutes but after that for a bigger files containing 300,000 entries it took around 5 hours. is there some way to do it through multi processing? i am a beginner python programmer so dont have idea to use multi processing or multi threading in python. here is my code
import json
import time
import pandas as pd
csv_project=pd.DataFrame([],columns=['abstract','authors','n_citation',"references","title","venue","year",'id'])
with open('test.json','r') as f:
data = f.readlines()
j=0
for k,i in enumerate(data):
if '{' in i and '}' in i:
j+=1
dictionary=json.loads(i)
csv_project=csv_project.append(dictionary,ignore_index=True)
else:
pass
if j == 10000:
print(str(k)+'number of entries done')
csv_project.to_csv('data.csv')
j=0
csv_project.to_csv('data.csv')
Any useful help will be appreciated. edit here is the sample json format .
{"abstract": "AdaBoost algorithm based on Haar-like features can achieves high accuracy (above 95%) in object detection.",
"authors": ["Zheng Xu", "Runbin Shi", "Zhihao Sun", "Yaqi Li", "Yuanjia Zhao", "Chenjian Wu"],
"n_citation": 0,
"references": ["0a11984c-ab6e-4b75-9291-e1b700c98d52", "1f4152a3-481f-4adf-a29a-2193a3d4303c", "3c2ddf0a-237b-4d17-8083-c90df5f3514b", "522ce553-29ea-4e0b-9ad3-0ed4eb9de065", "579e5f24-5b13-4e92-b255-0c46d066e306", "5d0b987d-eed9-42ce-9bf3-734d98824f1b", "80656b4d-b24c-4d92-8753-bdb965bcd50a", "d6e37fb1-5f7e-448e-847b-7d1f1271c574"],
"title": "A Heterogeneous System for Real-Time Detection with AdaBoost",
"venue": "high performance computing and communications",
"year": 2016,
"id": "001eef4f-1d00-4ae6-8b4f-7e66344bbc6e"}
{"abstract": "In this paper, a kind of novel jigsaw EBG structure is designed and applied into conformal antenna array",
"authors": ["Yufei Liang", "Yan Zhang", "Tao Dong", "Shan-wei Lu"],
"n_citation": 0,
"references": [],
"title": "A novel conformal jigsaw EBG structure design",
"venue": "international conference on conceptual structures",
"year": 2016,
"id": "002e0b7e-d62f-4140-b015-1fe29a9acbaa"}
pd.DataFramein each loop, which takes a lot of processing power. Also manually parsing ajsonfile (even partially) afterimport jsonis like building your own car from scratch instead of driving the one you already bought.jsonas an minimal reproducible example it'll help us bring light to improvement.