3

I'm trying to parse a really large JSON file in Python. The file has 6523440 lines but is broken into a lot of JSON objects.

The structure looks like this:

[
  {
    "projects": [
     ...
    ]
  }
]
[
  {
    "projects": [
     ...
    ]
  }
]
....
....
....

and it goes on and on...

Every time I try to load it using json.load() I get an error

ValueError: Extra data: line 2247 column 1 - line 6523440 column 1 (char 101207 - 295464118)

On the line where the first object ends and the second one starts. Is there a way to load them separately or anything similar?

6
  • 1
    I think you would have to parse the file yourself and split it into separate objects before passing it to json.load - it doesn't handle reading a bit and passing it back like e.g. pickle, AFAIK. Commented Oct 29, 2015 at 14:01
  • That structure suggests multiple arrays of one object Commented Oct 29, 2015 at 14:04
  • It is. 2900 of them to be precise Commented Oct 29, 2015 at 14:17
  • 1
    It sounds like your file is missing a comma at the end of the previous line (or something similar). Commented Oct 29, 2015 at 14:28
  • 3
    It's erroring because that's not valid JSON. You need to separate the elements with commas in the right places. Commented Oct 29, 2015 at 14:36

2 Answers 2

2

You can try using a streaming json library like ijson:

Sometimes when dealing with a particularly large JSON payload it may worth to not even construct individual Python objects and react on individual events immediately producing some result

Sign up to request clarification or add additional context in comments.

Comments

0

Try using json.JSONDecoder.raw_decode. It still requires you to have the entire document in memory, but allows you to iteratively decode many objects from one string.

import re
import json

document = """
[
    1,
    2,
    3
]
{
    "a": 1,
    "b": 2,
    "c": 3
}
"""

not_whitespace = re.compile(r"\S")

decoder = json.JSONDecoder()

items = []
index = 0
while True:
    match = not_whitespace.search(document, index)
    if not match:
        break

    item, index = decoder.raw_decode(document, match.start())
    items.append(item)

print(items)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.