3

I have a .json file where each line is an object. For example, first two lines are:

{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}

{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}

I have tried processing using ijson lib as follows:

with open(filename, 'r') as f:
    objects = ijson.items(f, 'columns.items')
    columns = list(objects) 

However, i get error:

JSONError: Additional data

Its seems due to multiple objects I'm receiving such error.

Whats the recommended way for analyzing such Json file in Jupyter?

Thank You in advance

3
  • Is your entire file actually valid json? Or is only each line valid json? Commented Aug 8, 2018 at 18:01
  • I can't give an answer without more concrete json, but you could try turning it into a list by separating the json objects with , and wrapping them in [] Commented Aug 8, 2018 at 18:01
  • Seems like each line is a valid json and there are millions of lines. Commented Aug 8, 2018 at 19:38

3 Answers 3

3

The file format is not correct if this is the complete file. Between the curly brackets there must be a comma and it should start and end with a square bracket. Like so: [{...},{...}]. For your data it would look like:

[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]

Here is some code how to clean your file:

lastline = None

with open("yourfile.json","r") as f:
    lineList = f.readlines()
    lastline=lineList[-1]

with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
    for i,line in enumerate(f,0):
        if i == 0:
            line = "["+str(line)+","
            g.write(line)
        elif line == lastline:            
            g.write(line)
            g.write("]")
        else:
            line = str(line)+","
            g.write(line)

To read a json file properly you could also consider using the pandas library (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html).

import pandas as pd

#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")

If you are not familiar with pandas, here a quick headstart, how to work with a dataframe object:

df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2 
Sign up to request clarification or add additional context in comments.

4 Comments

Issue is I have a quite a large json file and causing memory error if I attempt to do pandas read_json. Hence, I am attempting to follow instructions from dataquest.io/blog/python-json-tutorial. Then again, file format is incorrect. I should find a way to wrap them in a square brackets separated by commas.
I just added some code that you could use to create a new json file in the correct format. However if your file really is too big for pandas, I fear that it will take a while.
and it is pretty fast than i expected
was this your accepted answer, or did you still had some issues?
2

While each line on it's own is valid JSON, your file as a whole is not. As such, you can't parse it in one go, you will have to iterate over each line parse it into an object.

You can aggregate these objects in one list, and from there do whatever you like with your data :

import json
with open(filename, 'r') as f:
    object_list = []
    for line in f.readlines():
        object_list.append(json.loads(line))
    # object_list will contain all of your file's data

You could do it as a list comprehension to have it a little more pythonic :

with open(filename, 'r') as f:    
    object_list = [json.loads(line) 
                   for line in f.readlines()]
    # object_list will contain all of your file's data

3 Comments

I am not sure it's fair to say it's an invalid JSON file. For example, in my usecase instead of a file I have a network socket. Now what, would you say I have an invalid JSON socket? :) This answer works for me though.
JSON only allows one object at the root of the document. You have two. That's objectively invalid json. Now, that being said, it's a pretty common thing to have newline-delimited JSON. You simply need to read each line as a different JSON object, that is all. Treating the whole thing as a single object will fail because you have more than one.
The answer you have linked is a more complex solution that is warranted when there is no delimiter between objects. If you have the luxury of having newlinew between your objects, you should leverage it.
1

You have multiple lines in your file, so that's why it's throwing errors

import json

with open(filename, 'r') as f:
    lines = f.readlines()
    first = json.loads(lines[0])
    second = json.loads(lines[1])

That should catch both lines and load them in properly

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.