Parse large JSON file in Python

Question

I'm trying to parse a really large JSON file in Python. The file has 6523440 lines but is broken into a lot of JSON objects.

The structure looks like this:

[
  {
    "projects": [
     ...
    ]
  }
]
[
  {
    "projects": [
     ...
    ]
  }
]
....
....
....

and it goes on and on...

Every time I try to load it using json.load() I get an error

ValueError: Extra data: line 2247 column 1 - line 6523440 column 1 (char 101207 - 295464118)

On the line where the first object ends and the second one starts. Is there a way to load them separately or anything similar?

I think you would have to parse the file yourself and split it into separate objects before passing it to json.load - it doesn't handle reading a bit and passing it back like e.g. pickle, AFAIK. — jonrsharpe
– jonrsharpe, Commented Oct 29, 2015 at 14:01
It sounds like your file is missing a comma at the end of the previous line (or something similar). — Nick Bastin
– Nick Bastin, Commented Oct 29, 2015 at 14:28
It's erroring because that's not valid JSON. You need to separate the elements with commas in the right places. — Keith
– Keith, Commented Oct 29, 2015 at 14:36

Dan Cornilescu · Accepted Answer · 2015-10-30 04:08:35Z

2

You can try using a streaming json library like ijson:

Sometimes when dealing with a particularly large JSON payload it may worth to not even construct individual Python objects and react on individual events immediately producing some result

edited Oct 30, 2015 at 4:08

Dan Cornilescu

39.8k12 gold badges61 silver badges102 bronze badges

answered Oct 29, 2015 at 14:28

shyam

9,3864 gold badges32 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Dunes · Accepted Answer · 2015-10-30 07:11:03Z

0

Try using json.JSONDecoder.raw_decode. It still requires you to have the entire document in memory, but allows you to iteratively decode many objects from one string.

import re
import json

document = """
[
    1,
    2,
    3
]
{
    "a": 1,
    "b": 2,
    "c": 3
}
"""

not_whitespace = re.compile(r"\S")

decoder = json.JSONDecoder()

items = []
index = 0
while True:
    match = not_whitespace.search(document, index)
    if not match:
        break

    item, index = decoder.raw_decode(document, match.start())
    items.append(item)

print(items)

answered Oct 30, 2015 at 7:11

Dunes

42.1k7 gold badges86 silver badges107 bronze badges

Collectives™ on Stack Overflow

Parse large JSON file in Python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related