Python: dynamic list parsing and processing

Question

I have popened a process which is producing a list of dictionaries, something like:

[{'foo': '1'},{'bar':2},...]

The list takes a long time to create and could be many gigabytes, so I don't want to reconstitute it in memory and then iterate over it.

How can I parse the partially completed list such that I can process each dictionary as it is received?

It's producing a list of dictionaries, or it's spewing out a long JSON string? — Ignacio Vazquez-Abrams
– Ignacio Vazquez-Abrams, Commented Jul 12, 2010 at 23:33
a list of dictionaries, but if there's an equivalent technique for processing a json-stream I'd be interested in hearing about that as well. — Mark Harrison
– Mark Harrison, Commented Jul 12, 2010 at 23:37
Are they general Python dicts of anything or just strings and ints? — John La Rooy
– John La Rooy, Commented Jul 13, 2010 at 1:47
Please clarify (1) can you modify the producing process? (2) if not, EXACTLY what bytes are being transmitted ... repr(list_of_dicts) or something else?? — John Machin
– John Machin, Commented Jul 13, 2010 at 2:31

Alex Martelli · Accepted Answer · 2010-07-13 00:16:07Z

The Python tokenizer is available as part of the Python standard library, module tokenize. It relies for its input on receiving at the start a readline function (which must supply to it a "line" of input), so it can operate incrementally -- if there are no newlines in your input, you can simulate that as long as you can identify spots where adding a newline is innocuous (not breaking up a token -- thanks to the starting [ everything will be one "logical" line anyway). The only tokens that will require care to avoid being broken will be quoted strings. I'm not pursuing this in depth at this time since if you actually have newlines in your input you won't need to worry.

From the stream of tokens you can reconstruct the string representing each dict in the list (from an opening brace token, to the balancing closed bracket), and use ast.literal_eval to get the corresponding Python dict.

So, do you have newlines in your input? if so, then the whole task should be very easy.

S.Lott · Accepted Answer · 2010-07-13 01:13:28Z

0

Pickle each dictionary separately. Shelve can help you do this.

Writer

import shelve
db= shelve.open(filename)
count= 0
for ...whatever...
    # build the object
    db[count]= object
    count += 1
db['size']= count
db.close

Reader

import shelve
db= shelve.open(filename)
size= db['size']
for i in xrange(size):
    object= db[i]
    # process the object
db.close()

answered Jul 13, 2010 at 1:13

S.Lott

393k83 gold badges520 silver badges791 bronze badges

Collectives™ on Stack Overflow

Python: dynamic list parsing and processing

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related