1

I have a large JSON files ~5GB, but instead of being made up of one JSON file it has several concatenated together.

{"created_at":"Mon Jan 13 20:01:57 +0000 2014","id":422820833807970304,"id_str":"422820833807970304"}
{"created_at":"Mon Jan 13 20:01:57 +0000     2014","id":422820837545500672,"id_str":"422820837545500672"}.....

With no new line between the curly brackets }{.

I tried replacing the curly brackets with a newline using sed then reading the file with:

data=[]
for line in open(filename,'r').readline():
data.append(json.loads(line))

But this doesn't work.

How can I read this file relatively quickly?

Any help greatly appreciated

7
  • When you try to use data.append(json.loads(line)); you are loading the entire 5 GB of data to your RAM. Commented Apr 6, 2014 at 22:32
  • OK. But even when I split the file into smaller files (50mb), I cannot read the separate JSON files. Commented Apr 6, 2014 at 22:54
  • 2
    This looks like a Mongo database dump. Maybe put it back into a Mongo database and use the python interface to that? Commented Apr 6, 2014 at 22:55
  • Basically I have a lot of JSON files (where each JSON represents a document) together in one file. How do I split them up and parse them? Commented Apr 6, 2014 at 22:56
  • 1
    what if you split the huge line into many lines with sed 's/}{/}\n{/g' Commented Apr 6, 2014 at 22:59

1 Answer 1

1

This is a hack. It does not load the whole file into memory. I really hope you use Python 3.

DecodeLargeJSON.py

from DecodeLargeJSON import *
import io
import json

# create a file with two jsons
f = io.StringIO()
json.dump({1:[]}, f)
json.dump({2:"hallo"}, f)
print(repr(f.getvalue()))
f.seek(0) 

# decode the file f. f could be any file from here on. f.read(...) should return str
o1, idx1 = json.loads(FileString(f), cls = BigJSONDecoder)
print(o1) # this is the loaded object
# idx1 is the index that the second object begins with
o2, idx2 = json.loads(FileString(f, idx1), cls = BigJSONDecoder)
print(o2)

If you notice some objects that can not be decoded then you can tell me and we can find a solution.

Disclaimer This is not the valid and best solution. It is a hack that shows how it can be made possible.

Discussion Because it does not load the whole file into memory, regular expressions do not work. It also uses the Python implementation and not the C implementation. This could make it slower. I really hate how this easy task is hard. Hopefully somebody else points out a different solution.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, trying it out now. I use Python 2.7

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.