1

I am creating a very large array. Rather than having this array stored in memory, I want to be able to write it to a file. This needs to be in a format I can later import.

I would use pickle but it appears pickle is used for completed file structures.

In the following example, I need a way for the out variable to be a file rather than a memory stored object:

out = []
for x in y:
    z = []
    #get lots of data into z
    out.append(z)

3 Answers 3

2

Take a look at streaming-pickle.

streaming-pickle allows you to save/load a sequence of Python data structures to/from disk in a streaming (incremental) manner, thus using far less memory than regular pickle.

It's actually just a single file with three short methods. I added a snippet with an example:

try:
    from cPickle import dumps, loads
except ImportError:
    from pickle import dumps, loads


def s_dump(iterable_to_pickle, file_obj):
    """ dump contents of an iterable iterable_to_pickle to file_obj, a file
    opened in write mode """
    for elt in iterable_to_pickle:
        s_dump_elt(elt, file_obj)

def s_dump_elt(elt_to_pickle, file_obj):
    """ dumps one element to file_obj, a file opened in write mode """
    pickled_elt_str = dumps(elt_to_pickle)
    file_obj.write(pickled_elt_str)
    # record separator is a blank line
    # (since pickled_elt_str might contain its own newlines)
    file_obj.write('\n\n')

def s_load(file_obj):
    """ load contents from file_obj, returning a generator that yields one
        element at a time """
    cur_elt = []
    for line in file_obj:
        cur_elt.append(line)

        if line == '\n':
            pickled_elt_str = ''.join(cur_elt)
            elt = loads(pickled_elt_str)
            cur_elt = []
            yield elt

Here's how you could use it:

from __future__ import print_function
import os
import sys

if __name__ == '__main__':
    if os.path.exists('obj.serialized'):
        # load a file 'obj.serialized' from disk and 
        # spool through iterable      
        with open('obj.serialized', 'r') as handle:
            _generator = s_load(handle)
            for element in _generator:
                print(element)
    else:
        # or create it first, otherwise
        with open('obj.serialized', 'w') as handle:
            for i in xrange(100000):
                s_dump_elt({'i' : i}, handle)
Sign up to request clarification or add additional context in comments.

Comments

1

HDF5 maybe? It's got fairly broad support, and lets you append to existing datasets.

Comments

0

I could imagine you use string pickling with prepending a length indicator:

import os
import struct
import pickle # or cPickle

def loader(inf):
    while True:
        s = inf.read(4)
        if not s: return
        length, = struct.unpack(">L", s)
        data = inf.read(length)
        yield pickle.loads(data)

if __name__ == '__main__':
    if os.path.exists('dumptest'):
        # load file
        with open('dumptest', 'rb') as inf:
            for element in loader(inf):
                print element
    else:
        # or create it first, otherwise
        with open('dumptest', 'wb') as outf:
            for i in xrange(100000):
                dump = pickle.dumps({'i' : i}, protocol=-1) # or whatever you want as protocol...
                lenstr = struct.pack(">L", len(dump))
                outf.write(lenstr + dump)

This doesn't cache any data longer than really needed, separates the items from each other and additionally is compatible with all pickling protocols.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.