Json to csv conversion taking very huge time on large files in python

Question

I am trying to convert a very large json file to csv.The code works well with smaller files but taking so much time while running the same code on larger files i tested it first on 91 mb file containing 80,000 entries and it took around 45 minutes but after that for a bigger files containing 300,000 entries it took around 5 hours. is there some way to do it through multi processing? i am a beginner python programmer so dont have idea to use multi processing or multi threading in python. here is my code

import json
import time
import pandas as pd

csv_project=pd.DataFrame([],columns=['abstract','authors','n_citation',"references","title","venue","year",'id'])


with open('test.json','r') as f:
    data = f.readlines()
j=0
for k,i in enumerate(data):

    if '{' in i and '}' in i:

        j+=1
        dictionary=json.loads(i)
        csv_project=csv_project.append(dictionary,ignore_index=True)
    else:
        pass 
    if j == 10000:
        print(str(k)+'number of entries done')
        csv_project.to_csv('data.csv')
        j=0
csv_project.to_csv('data.csv')

Any useful help will be appreciated. edit here is the sample json format .

    {"abstract": "AdaBoost algorithm based on Haar-like features can achieves high accuracy (above 95%) in object detection.", 
"authors": ["Zheng Xu", "Runbin Shi", "Zhihao Sun", "Yaqi Li", "Yuanjia Zhao", "Chenjian Wu"], 
"n_citation": 0,
 "references": ["0a11984c-ab6e-4b75-9291-e1b700c98d52", "1f4152a3-481f-4adf-a29a-2193a3d4303c", "3c2ddf0a-237b-4d17-8083-c90df5f3514b", "522ce553-29ea-4e0b-9ad3-0ed4eb9de065", "579e5f24-5b13-4e92-b255-0c46d066e306", "5d0b987d-eed9-42ce-9bf3-734d98824f1b", "80656b4d-b24c-4d92-8753-bdb965bcd50a", "d6e37fb1-5f7e-448e-847b-7d1f1271c574"],
 "title": "A Heterogeneous System for Real-Time Detection with AdaBoost",
 "venue": "high performance computing and communications",
 "year": 2016,
 "id": "001eef4f-1d00-4ae6-8b4f-7e66344bbc6e"}


{"abstract": "In this paper, a kind of novel jigsaw EBG structure is designed and applied into conformal antenna array",
 "authors": ["Yufei Liang", "Yan Zhang", "Tao Dong", "Shan-wei Lu"], 
"n_citation": 0, 
"references": [], 
"title": "A novel conformal jigsaw EBG structure design", 
"venue": "international conference on conceptual structures", 
"year": 2016, 
"id": "002e0b7e-d62f-4140-b015-1fe29a9acbaa"}

At a glance, the biggest problem I see is that you are reassigning a new appended pd.DataFrame in each loop, which takes a lot of processing power. Also manually parsing a json file (even partially) after import json is like building your own car from scratch instead of driving the one you already bought. — r.ook
– r.ook, Commented Nov 26, 2018 at 18:48
Have you profiled your code to see if the reading of the file or the processing is the bottleneck? — jdrd
– jdrd, Commented Nov 26, 2018 at 18:49
@jdrd reading takes no time.its writing which is taking time — Umer iqbal
– Umer iqbal, Commented Nov 26, 2018 at 18:50
@Idlehands as i have mentioned i am a beginner python programmer and still learning. can you elaborate a little how to overcome this solution — Umer iqbal
– Umer iqbal, Commented Nov 26, 2018 at 18:51
There's no shame in being a beginner, we've all been there. I'm just trying to point out where your bottleneck is. Having said that, if you can share the structure of your json as an minimal reproducible example it'll help us bring light to improvement. — r.ook
– r.ook, Commented Nov 26, 2018 at 18:52

Daniel · Accepted Answer · 2018-11-26 19:03:43Z

1

You keep all your data in memory, once as lines and once as dataframe. This could slow down your processing.

Using the csv-module would allow you, to process the file in streaming mode:

import json
import csv

with open('test.json') as lines, open('data.csv', 'w') as output:
    output = csv.DictWriter(output, ['abstract','authors','n_citation',"references","title","venue","year",'id'])
    output.writeheader()
    for line in lines:
        line = line.strip()
        if line[0] == '{' and line[-1] == '}':
            output.writerow(json.loads(line))

answered Nov 26, 2018 at 19:03

Daniel

42.9k4 gold badges57 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Umer iqbal Over a year ago

this is giving me error Traceback (most recent call last): File "myscript.py", line 10, in <module> output.writerow(json.loads(line)) File "C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\csv.py", line 155, in writerow return self.writer.writerow(self._dict_to_list(rowdict)) File "C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0144' in position 667: character maps to <undefined>

Umer iqbal Over a year ago

thanks alot you are life saver (y). just added encoding='utf-8' in file opening syntax and it worked but there is space of 1 line in each entry how to over come this

Umer iqbal Over a year ago

like for example there should be 80000 lines but now it is showing 160000 lines. one line is extra after each entry

r.ook Over a year ago

Add lineterminator='\n' as an argument to csv.DictWriter(...)

Umer iqbal Over a year ago

thanks alot guys you both are too good for me today. having a bad day and than you guys come and save my day (y)

|

r.ook · Accepted Answer · 2018-11-26 19:44:02Z

It seems you're reading a json lines file, which might look something like this:

{key1: value1, key2: [value2, value3, value4], key3: value3}
{key1: value4, key2: [value5, value6], key3: value7}

Notice no commas at the end, and each line itself is a valid json format.

Lucky for you, pandas can read the json lines file directly like this:

pd.read_json('test.json', lines=True)

Since your column names are exactly the same as your json keys, there's no need for you to set up a blank DataFrame ahead of time. The read_json will do all the parsing for you. Example:

df = pd.read_json('test.json', lines=True)
print(df)

                                            abstract  ...   year
0  AdaBoost algorithm based on Haar-like features...  ...   2016
1  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]

Even luckier, if you are limited by size, there is a chunksize argument you can use which turns the .read_json method into a generator:

json_reader = pd.read_json('test.json', lines=True, chunksize=10000)

Now when you iterate through json_reader, each time it will output a DataFrame of the next 10,000 rows from the json file. Example:

for j in json_reader:
  print(j)

                                            abstract  ...   year
0  AdaBoost algorithm based on Haar-like features...  ...   2016
1  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]
                                            abstract  ...   year
2  AdaBoost algorithm based on Haar-like features...  ...   2016
3  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]
                                            abstract  ...   year
4  AdaBoost algorithm based on Haar-like features...  ...   2016
5  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]

Combining all these newfound knowledge, you can use the chunksize=10000 and output the chunked DataFrame as a separate csv like so:

for i, df in enumerate(json_reader):
  df.to_csv('my_csv_file_{}'.format(i))

Here you notice I combined enumerate() function so we can get an auto-incremented index number, and str.format() function to append the index number to the generated csv file.

You can see an example here on Repl.it.

Collectives™ on Stack Overflow

Json to csv conversion taking very huge time on large files in python

2 Answers 2

6 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related