Reading large JSON file in Python

Question

I have a large JSON files ~5GB, but instead of being made up of one JSON file it has several concatenated together.

{"created_at":"Mon Jan 13 20:01:57 +0000 2014","id":422820833807970304,"id_str":"422820833807970304"}
{"created_at":"Mon Jan 13 20:01:57 +0000     2014","id":422820837545500672,"id_str":"422820837545500672"}.....

With no new line between the curly brackets }{.

I tried replacing the curly brackets with a newline using sed then reading the file with:

data=[]
for line in open(filename,'r').readline():
data.append(json.loads(line))

But this doesn't work.

How can I read this file relatively quickly?

Any help greatly appreciated

When you try to use data.append(json.loads(line)); you are loading the entire 5 GB of data to your RAM. — hjpotter92
– hjpotter92, Commented Apr 6, 2014 at 22:32
OK. But even when I split the file into smaller files (50mb), I cannot read the separate JSON files. — user3240210
– user3240210, Commented Apr 6, 2014 at 22:54
This looks like a Mongo database dump. Maybe put it back into a Mongo database and use the python interface to that? — Paul
– Paul, Commented Apr 6, 2014 at 22:55
Basically I have a lot of JSON files (where each JSON represents a document) together in one file. How do I split them up and parse them? — user3240210
– user3240210, Commented Apr 6, 2014 at 22:56
what if you split the huge line into many lines with sed 's/}{/}\n{/g' — glenn jackman
– glenn jackman, Commented Apr 6, 2014 at 22:59

User · Accepted Answer · 2014-04-07 05:30:17Z

1

This is a hack. It does not load the whole file into memory. I really hope you use Python 3.

DecodeLargeJSON.py

from DecodeLargeJSON import *
import io
import json

# create a file with two jsons
f = io.StringIO()
json.dump({1:[]}, f)
json.dump({2:"hallo"}, f)
print(repr(f.getvalue()))
f.seek(0) 

# decode the file f. f could be any file from here on. f.read(...) should return str
o1, idx1 = json.loads(FileString(f), cls = BigJSONDecoder)
print(o1) # this is the loaded object
# idx1 is the index that the second object begins with
o2, idx2 = json.loads(FileString(f, idx1), cls = BigJSONDecoder)
print(o2)

If you notice some objects that can not be decoded then you can tell me and we can find a solution.

Disclaimer This is not the valid and best solution. It is a hack that shows how it can be made possible.

Discussion Because it does not load the whole file into memory, regular expressions do not work. It also uses the Python implementation and not the C implementation. This could make it slower. I really hate how this easy task is hard. Hopefully somebody else points out a different solution.

edited Apr 7, 2014 at 5:30

answered Apr 7, 2014 at 5:24

User

15k3 gold badges44 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3240210 Over a year ago

Thanks, trying it out now. I use Python 2.7

Collectives™ on Stack Overflow

Reading large JSON file in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related