create working data structure in file

Question

I am creating a very large array. Rather than having this array stored in memory, I want to be able to write it to a file. This needs to be in a format I can later import.

I would use pickle but it appears pickle is used for completed file structures.

In the following example, I need a way for the out variable to be a file rather than a memory stored object:

out = []
for x in y:
    z = []
    #get lots of data into z
    out.append(z)

miku · Accepted Answer · 2012-12-12 15:07:48Z

Take a look at streaming-pickle.

streaming-pickle allows you to save/load a sequence of Python data structures to/from disk in a streaming (incremental) manner, thus using far less memory than regular pickle.

It's actually just a single file with three short methods. I added a snippet with an example:

try:
    from cPickle import dumps, loads
except ImportError:
    from pickle import dumps, loads


def s_dump(iterable_to_pickle, file_obj):
    """ dump contents of an iterable iterable_to_pickle to file_obj, a file
    opened in write mode """
    for elt in iterable_to_pickle:
        s_dump_elt(elt, file_obj)

def s_dump_elt(elt_to_pickle, file_obj):
    """ dumps one element to file_obj, a file opened in write mode """
    pickled_elt_str = dumps(elt_to_pickle)
    file_obj.write(pickled_elt_str)
    # record separator is a blank line
    # (since pickled_elt_str might contain its own newlines)
    file_obj.write('\n\n')

def s_load(file_obj):
    """ load contents from file_obj, returning a generator that yields one
        element at a time """
    cur_elt = []
    for line in file_obj:
        cur_elt.append(line)

        if line == '\n':
            pickled_elt_str = ''.join(cur_elt)
            elt = loads(pickled_elt_str)
            cur_elt = []
            yield elt

Here's how you could use it:

from __future__ import print_function
import os
import sys

if __name__ == '__main__':
    if os.path.exists('obj.serialized'):
        # load a file 'obj.serialized' from disk and 
        # spool through iterable      
        with open('obj.serialized', 'r') as handle:
            _generator = s_load(handle)
            for element in _generator:
                print(element)
    else:
        # or create it first, otherwise
        with open('obj.serialized', 'w') as handle:
            for i in xrange(100000):
                s_dump_elt({'i' : i}, handle)

ChrisC · Accepted Answer · 2012-12-12 15:00:17Z

1

HDF5 maybe? It's got fairly broad support, and lets you append to existing datasets.

answered Dec 12, 2012 at 15:00

ChrisC

1,2729 silver badges9 bronze badges

Comments

glglgl · Accepted Answer · 2012-12-12 15:27:08Z

I could imagine you use string pickling with prepending a length indicator:

import os
import struct
import pickle # or cPickle

def loader(inf):
    while True:
        s = inf.read(4)
        if not s: return
        length, = struct.unpack(">L", s)
        data = inf.read(length)
        yield pickle.loads(data)

if __name__ == '__main__':
    if os.path.exists('dumptest'):
        # load file
        with open('dumptest', 'rb') as inf:
            for element in loader(inf):
                print element
    else:
        # or create it first, otherwise
        with open('dumptest', 'wb') as outf:
            for i in xrange(100000):
                dump = pickle.dumps({'i' : i}, protocol=-1) # or whatever you want as protocol...
                lenstr = struct.pack(">L", len(dump))
                outf.write(lenstr + dump)

This doesn't cache any data longer than really needed, separates the items from each other and additionally is compatible with all pickling protocols.

Collectives™ on Stack Overflow

create working data structure in file

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related