9

I have downloaded 5MB of a very large json file. From this, I need to be able to load that 5MB to generate a preview of the json file. However, the file will probably be incomplete. Here's an example of what it may look like:

[{
    "first": "bob",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {
    "first": "sarah",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {"first" : "tom"

From here, I'd like to "rebuild it" so that it can parse the first two objects (and ignore the third).

Is there a json parser that can infer or cut off the end of the string to make it parsable? Or perhaps to 'stream' the parsing of the json array, so that when it fails on the last object, I can exit the loop? If not, how could the above be accomplished?

5
  • That's such a broken API. I don't think it's possible to parse an arbitrary snippet of JSON. It could have any nested structure that just gets cut off. Commented Dec 26, 2018 at 21:40
  • If it's trivial (as in your example), just do it by hand. If it's not, um...err.... Commented Dec 26, 2018 at 21:40
  • This is just a bad recommend. You can use pandas to read json and then for preview df.head(2).to_json(orient='records') Commented Dec 26, 2018 at 21:49
  • 1
    seems that you don't have a lot of luck with input formats :) Commented Dec 26, 2018 at 22:01
  • 1
    What's the reason you want to include the first two "objects" (I suppose you mean the two outermost dicts) but not the third? What makes the json incomplete is a missing }], so by only considering the json format there's no reason to exclude the {"first": "tom"}. The exact criteria for where to crop the string are crucial for developing an algorithm. Commented Dec 26, 2018 at 23:59

2 Answers 2

8

If your data will always look somewhat similar, you could do something like this:

import json

json_string = """[{
    "first": "bob",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {
    "first": "sarah",
    "address": {
        "street": 13301,
        "zip": 1920
    }
}, {"first" : "tom"
"""

while True:
    if not json_string:
        raise ValueError("Couldn't fix JSON")
    try:
        data = json.loads(json_string + "]")
    except json.decoder.JSONDecodeError:
        json_string = json_string[:-1]
        continue
    break

print(data)

This assumes that the data is a list of dicts. Step by step, the last character is removed and a missing ] appended. If the new string can be interpreted as JSON, the infinite loop breaks. Otherwise the next character is removed and so on. If there are no characters left ValueError("Couldn't fix JSON") is raised.

For the above example, it prints:

[{'first': 'bob', 'address': {'zip': 1920, 'street': 13301}}, {'first': 'sarah', 'address': {'zip': 1920, 'street': 13301}}]
Sign up to request clarification or add additional context in comments.

1 Comment

oh I like that simple approach.
1

For the specific structure in the example we can walk through the string and track occurrences of curly brackets and their closing counterparts. If at the end one or more curly brackets remain unmatched, we know that this indicates an incomplete object. We can then strip any intermediate characters such as commas or whitespace and close the resulting string with a square bracket.

This method ensures that the string is only parsed twice, one time manually and one time by the JSON parser, which might be advantageous for large text files (with incomplete objects consisting of many characters).

brackets = []
for i, c in enumerate(string):
    if c == '{':
        brackets.append(i)
    elif c == '}':
        brackets.pop()

if brackets:
    string = string[:brackets[0]].rstrip(', \n')

if not string.endswith(']'):
    string += ']'

2 Comments

thanks for this answer. As a generalization to the above, how could you 'complete' something more general, such as this ?
@David542 Actually the algorithm is not completing but reducing the json to its valid parts and closing it with the missing brackets. In that sense the specific structure you linked is not more general, it is just a different structure (with different bracketing). Here you can define an offset within the string and then use the same procedure as above: marker = '"data": ['; offset = string.index(marker) + len(marker). You'll need to close the resulting string with ]}]}} instead. If you want to be independent of the structure you'll need to write your own custom json parser.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.