How to extract large json file to csv using Python

Question

I'm trying to convert a very large .json file to a .csv file. Here is a sample of the json file I have been using. The file I'll be getting directly from a journal publisher in the same format.

The main purpose of this is to extract all the component from the .json file and put the information to our database.

Below is the code I have tried.

import csv, json, sys

if sys.argv[1] is not None and sys.argv[2] is not None:
  fileInput = sys.argv[1]
  fileOutput = sys.argv[2]
  inputFile = open(fileInput, encoding="utf8") #open json file
  outputFile = open(fileOutput, 'w') #load csv file
  data = json.load(inputFile) #load json content
  inputFile.close() #close the input file
  output = csv.writer(outputFile) #create a csv.write
  output.writerow(data[0].keys())  # header row
  for row in data:
     output.writerow(row.values()) #values row

I'm getting this error:

json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 542)

@MosheRabaev Actually the publisher sends .jsonl file and that file is converted to .json. the converted file is what I'm using for .csv conversion — Varsha G
– Varsha G, Commented Jul 2, 2019 at 4:12
@MosheRabaev it isn't valid json. and my answer below goes in-depth about why it's not valid json. — hanshenrik
– hanshenrik, Commented Jul 2, 2019 at 4:41

hanshenrik · Accepted Answer · 2019-07-02 05:31:20Z

that is not valid json. the opening bracket on byte offset 0 is closed with a closing bracket on byte offset 383, then another bracket is created on byte offset 386, the new backet outside of the closing bracket on offset 383 which is created on byte offset 386 is illegal in json, the only thing that would be legal after the closing bracket is whitespace (spaces, tabs, newlines)

it looks a lot like 100 separate json's that are all line-separated, though, but there is no easy way of parsing that, as valid jsons may also contain newlines. if the data provider can guarantee that their individual jsons NEVER contains newlines, or that all their newlines are encoded in some other way than using hex 0A bytes, for example encoded with hex 5C6E instead of hex 0A, then you could ofc split up the jsons by newlines.. but that approach is unreliable if the data provider's jsons may contain newlines. (and the json specificaion allows newlines, 0x0A bytes, in jsons, so that would require your data provider to only use a newline-lacking subset of json.. if your provider is looking for a quick-fix to this issue: use NULL-bytes, hex 00, as the separator instead of hex 0x0A, because json never contains null bytes, those always has to be encoded in json, to "\u0000", then you could reliably split up the jsons by null-bytes)

here is what happens when i try to parse all 100 lines as individual jsons, splitting them by the 0x0A byte, using the code:

<?php
$jsons=file_get_contents("https://pastebin.com/raw/p9NbH2tG");
json_decode($jsons);
echo json_last_error_msg(),PHP_EOL;
$jsons=explode("\n",$jsons);
foreach($jsons as $json){
        json_decode($json);
        echo json_last_error_msg(),PHP_EOL;
}

output:

$ php foo.php
Syntax error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error
No error

as you can see, each individual line in your file contains valid json, but as a whole, it's not valid json. but splitting them by newlines is NOT a reliable way, it just happens to work here because there are no newlines in any of the 100 jsons in your test file.

WZLWebNmedia · Accepted Answer · 2019-07-02 04:12:37Z

1

This looks a lot like the question asked here Django convert JSON to CSV

Can you share a sample of the json response you are getting? Perhaps there is an issue with attempting to decode multiple dictionaries etc.

answered Jul 2, 2019 at 4:12

WZLWebNmedia

214 bronze badges

3 Comments

Varsha G Over a year ago

Here is the file I'm using for this process.

WZLWebNmedia Over a year ago

I agree with hanshenrik that this isn't valid json. You can try calling the file using json.loads() to verify this in your code.

WZLWebNmedia Over a year ago

I loaded the file into Python using json.load and it seems to be complaining about a special character containing within the file. Here's the message I got. "UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 108147: character maps to <undefined>"

Collectives™ on Stack Overflow

How to extract large json file to csv using Python

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related