Reading the JSON File with multiple objects in Python

Question

I'm a bit idiot in programming and Python. I know that these are a lot of explanations in previous questions about this but I carefully read all of them and I didn't find the solution.
I'm trying to read a JSON file which contains about 1 billion of data like this:

334465|{"color":"33ef","age":"55","gender":"m"}
334477|{"color":"3444","age":"56","gender":"f"}
334477|{"color":"3999","age":"70","gender":"m"}

I was trying hard to overcome that 6 digit numbers at the beginning of each line, but I dont know how can I read multiple JSON objects? Here is my code but I can't find why it is not working?

import json

T =[]
s = open('simple.json', 'r')
ss = s.read()
for line in ss:
    line = ss[7:]
    T.append(json.loads(line))
s.close()

And the here is the error that I got:

ValueError: Extra Data: line 3 column 1 - line 5 column 48 (char 42 - 138)

Any suggestion would be very helpful for me!

PM 2Ring · Accepted Answer · 2016-11-21 03:48:43Z

There are several problems with the logic of your code.

ss = s.read()

reads the entire file s into a single string. The next line

for line in ss:

iterates over each character in that string, one by one. So on each loop line is a single character. In

    line = ss[7:]

you are getting the entire file contents apart from the first 7 characters (in positions 0 through 6, inclusive) and replacing the previous content of line with that. And then

T.append(json.loads(line))

attempts to convert that to JSON and store the resulting object into the T list.

Here's some code that does what you want. We don't need to read the entire file into a string with .read, or into a list of lines with .readlines, we can simply put the file handle into a for loop and that will iterate over the file line by line.

We use a with statement to open the file, so that it will get closed automatically when we exit the with block, or if there's an IO error.

import json

table = []
with open('simple.json', 'r') as f:
    for line in f:
        table.append(json.loads(line[7:]))

for row in table:
    print(row)

output

{'color': '33ef', 'age': '55', 'gender': 'm'}
{'color': '3444', 'age': '56', 'gender': 'f'}
{'color': '3999', 'age': '70', 'gender': 'm'}

We can make this more compact by building the table list in a list comprehension:

import json

with open('simple.json', 'r') as f:
    table = [json.loads(line[7:]) for line in f]

for row in table:
    print(row)

Piotr Czapla · Accepted Answer · 2018-05-08 18:53:27Z

8

If you use Pandas you can simply write df = pd.read_json(f, lines=True)

as per doc the lines=True:

Read the file as a json object per line.

answered May 8, 2018 at 18:53

Piotr Czapla

26.8k26 gold badges106 silver badges123 bronze badges

Comments

Alex · Accepted Answer · 2016-11-21 03:30:50Z

0

You should use readlines() instead of read(), and wrap your JSON parsing in a try/except block. Your lines probably contain a trailing newline character and that would cause an error.

s = open('simple.json', 'r')
for line in s.readlines():
    try:
        j = line.split('|')[-1]
        json.loads(j)
    except ValueError:
        # You probably have bad JSON
        continue

answered Nov 21, 2016 at 3:30

Alex

1,5521 gold badge14 silver badges27 bronze badges

4 Comments

PM 2Ring Over a year ago

s.readlines() reads the entire text file into memory, which is rather wasteful, especially if 'simple.json' contains about a billion lines, as the OP states. The table of extracted data is going to chew up a lot of RAM so it's a Good Idea for the script to be as frugal as possible.

PM 2Ring Over a year ago

I just noticed that your code doesn't actually do anything with the object created with json.loads(j), it just throws it away.

Alex Over a year ago

I was solving the problem of parsing JSON properly. Thought it was apparent that you'd have to process that later.

PM 2Ring Over a year ago

Fair enough. And Raffael figured out what you meant. OTOH, SO answers are supposed to be of benefit to future readers too, not just the OP, but I guess if some future reader just blindly copies your code, they deserve all the trouble they get. :) Still, it's nice to avoid doing stuff like that in SO answers unless you clearly explain what you're doing in the accompanying text.

Raffael Edu · Accepted Answer · 2016-11-21 18:38:25Z

0

Thank you so much! You guys are life saver! This is the code that I eventually come up with it. I think it is the combination of all answers!

import json

table = []
with open('simple.json', 'r') as f:
    for line in f:
        try:
            j = line.split('|')[-1]
            table.append(json.loads(j))
        except ValueError:
            # You probably have bad JSON
            continue

for row in table:
    print(row)

answered Nov 21, 2016 at 18:38

Raffael Edu

911 gold badge1 silver badge5 bronze badges

2 Comments

PM 2Ring Over a year ago

1. As I mentioned to Alex, for line in s.readlines(): wastes time and RAM because it first reads the whole file into a list; just use for line in s: to iterate over the lines. 2. .split takes a maxsplit arg, so j = line.split('|', 1)[-1] is safer and more efficient than your version because it will stop scanning as soon as it finds the first '|' so you won't lose data if the JSON has a '|' in it.

PM 2Ring Over a year ago

3. You probably should print or log a warning message in your except block, and get rid of that continue: it does nothing since you're at the end of the loop anyway. If you really don't want to log the error, then replace continue with pass, to make it clearer what your code is doing.

Collectives™ on Stack Overflow

Reading the JSON File with multiple objects in Python

4 Answers 4

Comments

Comments

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related