8

I'm a bit idiot in programming and Python. I know that these are a lot of explanations in previous questions about this but I carefully read all of them and I didn't find the solution.
I'm trying to read a JSON file which contains about 1 billion of data like this:

334465|{"color":"33ef","age":"55","gender":"m"}
334477|{"color":"3444","age":"56","gender":"f"}
334477|{"color":"3999","age":"70","gender":"m"}

I was trying hard to overcome that 6 digit numbers at the beginning of each line, but I dont know how can I read multiple JSON objects? Here is my code but I can't find why it is not working?

import json

T =[]
s = open('simple.json', 'r')
ss = s.read()
for line in ss:
    line = ss[7:]
    T.append(json.loads(line))
s.close()

And the here is the error that I got:

ValueError: Extra Data: line 3 column 1 - line 5 column 48 (char 42 - 138)

Any suggestion would be very helpful for me!

0

4 Answers 4

9

There are several problems with the logic of your code.

ss = s.read()

reads the entire file s into a single string. The next line

for line in ss:

iterates over each character in that string, one by one. So on each loop line is a single character. In

    line = ss[7:]

you are getting the entire file contents apart from the first 7 characters (in positions 0 through 6, inclusive) and replacing the previous content of line with that. And then

T.append(json.loads(line))

attempts to convert that to JSON and store the resulting object into the T list.


Here's some code that does what you want. We don't need to read the entire file into a string with .read, or into a list of lines with .readlines, we can simply put the file handle into a for loop and that will iterate over the file line by line.

We use a with statement to open the file, so that it will get closed automatically when we exit the with block, or if there's an IO error.

import json

table = []
with open('simple.json', 'r') as f:
    for line in f:
        table.append(json.loads(line[7:]))

for row in table:
    print(row)

output

{'color': '33ef', 'age': '55', 'gender': 'm'}
{'color': '3444', 'age': '56', 'gender': 'f'}
{'color': '3999', 'age': '70', 'gender': 'm'}

We can make this more compact by building the table list in a list comprehension:

import json

with open('simple.json', 'r') as f:
    table = [json.loads(line[7:]) for line in f]

for row in table:
    print(row)
Sign up to request clarification or add additional context in comments.

Comments

8

If you use Pandas you can simply write df = pd.read_json(f, lines=True)

as per doc the lines=True:

Read the file as a json object per line.

Comments

0

You should use readlines() instead of read(), and wrap your JSON parsing in a try/except block. Your lines probably contain a trailing newline character and that would cause an error.

s = open('simple.json', 'r')
for line in s.readlines():
    try:
        j = line.split('|')[-1]
        json.loads(j)
    except ValueError:
        # You probably have bad JSON
        continue

4 Comments

s.readlines() reads the entire text file into memory, which is rather wasteful, especially if 'simple.json' contains about a billion lines, as the OP states. The table of extracted data is going to chew up a lot of RAM so it's a Good Idea for the script to be as frugal as possible.
I just noticed that your code doesn't actually do anything with the object created with json.loads(j), it just throws it away.
I was solving the problem of parsing JSON properly. Thought it was apparent that you'd have to process that later.
Fair enough. And Raffael figured out what you meant. OTOH, SO answers are supposed to be of benefit to future readers too, not just the OP, but I guess if some future reader just blindly copies your code, they deserve all the trouble they get. :) Still, it's nice to avoid doing stuff like that in SO answers unless you clearly explain what you're doing in the accompanying text.
0

Thank you so much! You guys are life saver! This is the code that I eventually come up with it. I think it is the combination of all answers!

import json

table = []
with open('simple.json', 'r') as f:
    for line in f:
        try:
            j = line.split('|')[-1]
            table.append(json.loads(j))
        except ValueError:
            # You probably have bad JSON
            continue

for row in table:
    print(row)

2 Comments

1. As I mentioned to Alex, for line in s.readlines(): wastes time and RAM because it first reads the whole file into a list; just use for line in s: to iterate over the lines. 2. .split takes a maxsplit arg, so j = line.split('|', 1)[-1] is safer and more efficient than your version because it will stop scanning as soon as it finds the first '|' so you won't lose data if the JSON has a '|' in it.
3. You probably should print or log a warning message in your except block, and get rid of that continue: it does nothing since you're at the end of the loop anyway. If you really don't want to log the error, then replace continue with pass, to make it clearer what your code is doing.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.