0

I am trying to convert a very large json file to csv.The code works well with smaller files but taking so much time while running the same code on larger files i tested it first on 91 mb file containing 80,000 entries and it took around 45 minutes but after that for a bigger files containing 300,000 entries it took around 5 hours. is there some way to do it through multi processing? i am a beginner python programmer so dont have idea to use multi processing or multi threading in python. here is my code

import json
import time
import pandas as pd

csv_project=pd.DataFrame([],columns=['abstract','authors','n_citation',"references","title","venue","year",'id'])


with open('test.json','r') as f:
    data = f.readlines()
j=0
for k,i in enumerate(data):

    if '{' in i and '}' in i:

        j+=1
        dictionary=json.loads(i)
        csv_project=csv_project.append(dictionary,ignore_index=True)
    else:
        pass 
    if j == 10000:
        print(str(k)+'number of entries done')
        csv_project.to_csv('data.csv')
        j=0
csv_project.to_csv('data.csv') 

Any useful help will be appreciated. edit here is the sample json format .

    {"abstract": "AdaBoost algorithm based on Haar-like features can achieves high accuracy (above 95%) in object detection.", 
"authors": ["Zheng Xu", "Runbin Shi", "Zhihao Sun", "Yaqi Li", "Yuanjia Zhao", "Chenjian Wu"], 
"n_citation": 0,
 "references": ["0a11984c-ab6e-4b75-9291-e1b700c98d52", "1f4152a3-481f-4adf-a29a-2193a3d4303c", "3c2ddf0a-237b-4d17-8083-c90df5f3514b", "522ce553-29ea-4e0b-9ad3-0ed4eb9de065", "579e5f24-5b13-4e92-b255-0c46d066e306", "5d0b987d-eed9-42ce-9bf3-734d98824f1b", "80656b4d-b24c-4d92-8753-bdb965bcd50a", "d6e37fb1-5f7e-448e-847b-7d1f1271c574"],
 "title": "A Heterogeneous System for Real-Time Detection with AdaBoost",
 "venue": "high performance computing and communications",
 "year": 2016,
 "id": "001eef4f-1d00-4ae6-8b4f-7e66344bbc6e"}


{"abstract": "In this paper, a kind of novel jigsaw EBG structure is designed and applied into conformal antenna array",
 "authors": ["Yufei Liang", "Yan Zhang", "Tao Dong", "Shan-wei Lu"], 
"n_citation": 0, 
"references": [], 
"title": "A novel conformal jigsaw EBG structure design", 
"venue": "international conference on conceptual structures", 
"year": 2016, 
"id": "002e0b7e-d62f-4140-b015-1fe29a9acbaa"}
16
  • 2
    At a glance, the biggest problem I see is that you are reassigning a new appended pd.DataFrame in each loop, which takes a lot of processing power. Also manually parsing a json file (even partially) after import json is like building your own car from scratch instead of driving the one you already bought. Commented Nov 26, 2018 at 18:48
  • Have you profiled your code to see if the reading of the file or the processing is the bottleneck? Commented Nov 26, 2018 at 18:49
  • @jdrd reading takes no time.its writing which is taking time Commented Nov 26, 2018 at 18:50
  • @Idlehands as i have mentioned i am a beginner python programmer and still learning. can you elaborate a little how to overcome this solution Commented Nov 26, 2018 at 18:51
  • There's no shame in being a beginner, we've all been there. I'm just trying to point out where your bottleneck is. Having said that, if you can share the structure of your json as an minimal reproducible example it'll help us bring light to improvement. Commented Nov 26, 2018 at 18:52

2 Answers 2

1

You keep all your data in memory, once as lines and once as dataframe. This could slow down your processing.

Using the csv-module would allow you, to process the file in streaming mode:

import json
import csv

with open('test.json') as lines, open('data.csv', 'w') as output:
    output = csv.DictWriter(output, ['abstract','authors','n_citation',"references","title","venue","year",'id'])
    output.writeheader()
    for line in lines:
        line = line.strip()
        if line[0] == '{' and line[-1] == '}':
            output.writerow(json.loads(line))
Sign up to request clarification or add additional context in comments.

6 Comments

this is giving me error Traceback (most recent call last): File "myscript.py", line 10, in <module> output.writerow(json.loads(line)) File "C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\csv.py", line 155, in writerow return self.writer.writerow(self._dict_to_list(rowdict)) File "C:\Users\HP\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u0144' in position 667: character maps to <undefined>
thanks alot you are life saver (y). just added encoding='utf-8' in file opening syntax and it worked but there is space of 1 line in each entry how to over come this
like for example there should be 80000 lines but now it is showing 160000 lines. one line is extra after each entry
Add lineterminator='\n' as an argument to csv.DictWriter(...)
thanks alot guys you both are too good for me today. having a bad day and than you guys come and save my day (y)
|
0

It seems you're reading a json lines file, which might look something like this:

{key1: value1, key2: [value2, value3, value4], key3: value3}
{key1: value4, key2: [value5, value6], key3: value7}

Notice no commas at the end, and each line itself is a valid json format.

Lucky for you, pandas can read the json lines file directly like this:

pd.read_json('test.json', lines=True)

Since your column names are exactly the same as your json keys, there's no need for you to set up a blank DataFrame ahead of time. The read_json will do all the parsing for you. Example:

df = pd.read_json('test.json', lines=True)
print(df)

                                            abstract  ...   year
0  AdaBoost algorithm based on Haar-like features...  ...   2016
1  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]

Even luckier, if you are limited by size, there is a chunksize argument you can use which turns the .read_json method into a generator:

json_reader = pd.read_json('test.json', lines=True, chunksize=10000)

Now when you iterate through json_reader, each time it will output a DataFrame of the next 10,000 rows from the json file. Example:

for j in json_reader:
  print(j)

                                            abstract  ...   year
0  AdaBoost algorithm based on Haar-like features...  ...   2016
1  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]
                                            abstract  ...   year
2  AdaBoost algorithm based on Haar-like features...  ...   2016
3  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]
                                            abstract  ...   year
4  AdaBoost algorithm based on Haar-like features...  ...   2016
5  In this paper, a kind of novel jigsaw EBG stru...  ...   2016

[2 rows x 8 columns]

Combining all these newfound knowledge, you can use the chunksize=10000 and output the chunked DataFrame as a separate csv like so:

for i, df in enumerate(json_reader):
  df.to_csv('my_csv_file_{}'.format(i))

Here you notice I combined enumerate() function so we can get an auto-incremented index number, and str.format() function to append the index number to the generated csv file.

You can see an example here on Repl.it.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.