1

I have popened a process which is producing a list of dictionaries, something like:

[{'foo': '1'},{'bar':2},...]

The list takes a long time to create and could be many gigabytes, so I don't want to reconstitute it in memory and then iterate over it.

How can I parse the partially completed list such that I can process each dictionary as it is received?

4
  • It's producing a list of dictionaries, or it's spewing out a long JSON string? Commented Jul 12, 2010 at 23:33
  • a list of dictionaries, but if there's an equivalent technique for processing a json-stream I'd be interested in hearing about that as well. Commented Jul 12, 2010 at 23:37
  • Are they general Python dicts of anything or just strings and ints? Commented Jul 13, 2010 at 1:47
  • Please clarify (1) can you modify the producing process? (2) if not, EXACTLY what bytes are being transmitted ... repr(list_of_dicts) or something else?? Commented Jul 13, 2010 at 2:31

2 Answers 2

2

The Python tokenizer is available as part of the Python standard library, module tokenize. It relies for its input on receiving at the start a readline function (which must supply to it a "line" of input), so it can operate incrementally -- if there are no newlines in your input, you can simulate that as long as you can identify spots where adding a newline is innocuous (not breaking up a token -- thanks to the starting [ everything will be one "logical" line anyway). The only tokens that will require care to avoid being broken will be quoted strings. I'm not pursuing this in depth at this time since if you actually have newlines in your input you won't need to worry.

From the stream of tokens you can reconstruct the string representing each dict in the list (from an opening brace token, to the balancing closed bracket), and use ast.literal_eval to get the corresponding Python dict.

So, do you have newlines in your input? if so, then the whole task should be very easy.

Sign up to request clarification or add additional context in comments.

Comments

0

Pickle each dictionary separately. Shelve can help you do this.

Writer

import shelve
db= shelve.open(filename)
count= 0
for ...whatever...
    # build the object
    db[count]= object
    count += 1
db['size']= count
db.close

Reader

import shelve
db= shelve.open(filename)
size= db['size']
for i in xrange(size):
    object= db[i]
    # process the object
db.close()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.