1

I'm running a large number of computations whose results I want to save to disk one item at a time, since whole data is too big to hold in memory. I tried using shelve to save it but I get the error:

HASH: Out of overflow pages.  Increase page size

my code is below. What is the right way to do this in python? pickle loads objects into memory. shelve supports on disk write, but forces a dictionary structure where you're limited by number of keys. the final data I am saving is just a list and does not need to be in dictionary form. Just need to be able to read it one item at a time.

import shelve
def my_data():
  # this is a generator that yields data points
  for n in xrange(very_large_number):
    yield data_point

def save_result():
  db = shelve.open("result")
  n = 0
  for data in my_data():
    # result is a Python object (a tuple)
    result = compute(data)
    # now save result to disk
    db[str(n)] = result
  db.close()

2 Answers 2

2

It's easy if you use klepto, which gives you the ability to transparently store objects in files or databases. First, I show working directly with the archive backend (i.e. writing directly to disk).

>>> import klepto
>>> db = klepto.archives.dir_archive('db', serialized=True, cached=False)
>>> db['n'] = 69     
>>> db['add'] = lambda x,y: x+y
>>> db['x'] = 42
>>> db['y'] = 11
>>> db['sub'] = lambda x,y: y-x
>>> 

Then we restart, creating a new connection to the on-disk "database".

Python 2.7.11 (default, Dec  5 2015, 23:50:48) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import klepto
>>> db = klepto.archives.dir_archive('db', serialized=True, cached=False)
>>> db     
dir_archive('db', {'y': 11, 'x': 42, 'add': <function <lambda> at 0x10e500d70>, 'sub': <function <lambda> at 0x10e500de8>, 'n': 69}, cached=False)
>>> 

Or you could create a new connection that uses an in-memory proxy. Below, I show only loading the desired entries to memory.

Python 2.7.11 (default, Dec  5 2015, 23:50:48) 
[GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import klepto
>>> db = klepto.archives.dir_archive('db', serialized=True, cached=True)
>>> db
dir_archive('db', {}, cached=True)
>>> db.load('x', 'y')  # read multiple
>>> db.load('add')     # read one at a time
>>> db
dir_archive('db', {'y': 11, 'x': 42, 'add': <function <lambda> at 0x1079e7d70>}, cached=True)
>>> db['result'] = db['add'](db['x'],db['y'])
>>> db['result']
53
>>>

…or one can also dump new entries to disk as well.

>>> db.dump('result')
>>>
Sign up to request clarification or add additional context in comments.

Comments

1

The following program demonstrates how you might want to go about the process you described in your question. It simulates the creation, writing, reading, and processing of data your application may need to replicate. In its default form, the code generates about 32 GB of data and writes it to disk. After a little experimentation, enabling gzip compression provides good speed and reduces the file size to about 195 MB. You should adjust the example for your problem and may find different compression techniques more suitable than others via trial and error.

#! /usr/bin/env python3
import os
import pickle


# Uncomment one of these imports to enable file compression:
# from bz2 import open
# from gzip import open
# from lzma import open


DATA_FILE = 'results.dat'
KB = 1 << 10
MB = 1 << 20
GB = 1 << 30
TB = 1 << 40


def main():
    """Demonstrate saving data to and loading data from a file."""
    save_data(develop_data())
    analyze_data(load_data())


def develop_data():
    """Create some sample data that can be saved for later processing."""
    return (os.urandom(1 * KB) * (1 * MB // KB) for _ in range(32 * GB // MB))


def save_data(data):
    """Take in all data and save it for retrieval later on."""
    with open(DATA_FILE, 'wb') as file:
        for obj in data:
            pickle.dump(obj, file, pickle.HIGHEST_PROTOCOL)


def load_data():
    """Load each item that was previously written to disk."""
    with open(DATA_FILE, 'rb') as file:
        try:
            while True:
                yield pickle.load(file)
        except EOFError:
            pass


def analyze_data(data):
    """Pretend to do something useful with each object that was loaded."""
    for obj in data:
        print(hash(obj))


if __name__ == '__main__':
    main()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.