0

I am looking for (Python interface to) an iterable data structure that can hold a large quantity of items. Ideally, the memory used by all the items in the list is larger than the available RAM: objects are transparently swapped in and out some disk file as they are accessed; only a small configurable number of them are loaded in RAM at any given time. In other words, I would like to see something like C++'s STXXL library, but I only need a list-like container.

Furthermore, the data structure needs to allow: storing arbitrary Python objects, adding/removing elements (either by position or by value), iterating over all elements, in/__contain__ checks, and (possibly) a quick way to select elements satisfying a simple attribute equality predicate (e.g., x.foo = 'bar')

Here's an example of the API that I would like to see::

   # persist list data to `foo.dat`, keep 100 items in memory
   l = FileBackedList('foo.dat', 100)

   # normal Python list operations work as expected
   l.append('x'); len(l) == 1
   l.extend([1, 2, 3])
   l.remove('x'); len(l) == 3
   l.pop(0);      len(l) == 2

   2 in l  # => True

   # there should be at least one way of doing the following
   k = [item for item in l if item > 2]
   k = filter(l, lambda item: item > 2)

It is acceptable that the implementation is not particularly fast or efficient; the ability to handle large amounts of objects with constrained memory is paramount.

Before I start rolling out my own implementation, is there any existing library that I can already plug into my app? Or at least some code to take inspiration from?

4 Answers 4

1

I believe you're looking for something like the memmap array from numpy. If you want a more fully-featured tabular data structure, the SFrame from graphlab works, but note that library is only free for non-commercial use. You can use numpy for anything.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the suggestions -- they definitely go in the right direction. However, NumPy's memmap works like a NumPy array, i.e., it is a container for homogeneous data which fits a NumPy dtype, so it cannot hold generic Python objects. Similarly, SFrames AFAICT provide a way for processing large CSV files: again, cannot be used as iterable containers for generic Python objects.
1

@Adam: The SFrame is open source. This is exactly what you need here (https://github.com/dato-code/SFrame)

1 Comment

This should rather be posted as a comment to Adam's answer.
1

shelve is in the python standard library and is about half of what you need. Unfortunately it only creates a dictionary-like object. So, I wrote a list-like interface for dictionary-like objects. Here's what it looks like in action:

import shelve
import tempfile
import os.path

from listlike import ListLike

with tempfile.TemporaryDirectory() as tempdir:
    with shelve.open(os.path.join(tempdir,'foo.dat')) as sf:
        q = ListLike(sf)

        # normal Python list operations work as expected
        q.append('x');
        assert len(q) == 1
        q.extend([1, 2, 3])
        q.remove('x');
        assert len(q) == 3
        q.pop(0);
        assert len(q) == 2
        q.append('Hello!')
        assert q[2] == 'Hello!'

        # technically you can use the list() to create an actual list
        assert list(q) == [2, 3, 'Hello!']
        # but if your sf is super large, this obviously is bad.
        # I use it here for illustration purposes.

        # still, you can use all the normal list operations
        # (that I remembered to implement/check)
        assert 2 in q  # => True
        del q[2]
        assert list(q[1:2]) == [3]
        assert list(q[-1:]) == [3]
        # except addition, 'cause we don't want to copy large data
        try:
            q + [10]
        except TypeError:
            pass
        # but, iadd works fine
        q += [10] # same as q.extend([10])


        # normal index range rules
        try:
            q[100]
        except IndexError:
            pass

        q.extend([0, 1, 2, 3, 4, 5])
        # both of the following work as intended
        k1 = [item for item in q if item > 2]
        k2 = list(filter(lambda item: item > 2, q))
        assert list(q) == [2, 3, 10, 0, 1, 2, 3, 4, 5]
        assert k1 == [3, 10, 3, 4, 5]
        assert k2 == [3, 10, 3, 4, 5]
        assert k1 == k2

        # the values get pickled, so they can be any arbitrary object
        q.append(A(1,2))
        q.append(A('Hello',' there'))
        q.append(A('Hello',2))

        # reminder: pickling pickles the whole class, so if you are
        # storing instances of your own custom class, then updates to
        # that class' code won't be reflected in the persisted instances.

Caveats

  1. Any operation that requires changing the indices of the elements (pop, insert, remove, del, etc.) is quite slow in this implementation (it pickles/unpickles everything to change indices). For the basic .append()/.extend() and get operations it is the same speed as indexing shelve.
  2. shelve uses dbm underneath. There are different dbm implementations.
  3. I think that if your number elements is so large that the list of keys() don't fit in memory, then shelve will run out of memory when trying to load the file.

Cached

Using a shelve.Shelf object, I found that it was not slower repeatedly reading 1000 arbitrary locations in a 1.5GB file with 13 million entries as it is in a 15MB file with 100 thousand entries.

I suspect that this is because operating systems/file systems do a lot of internal file caching to speed up reads to common sectors. Or perhaps gdbm already does some caching?

Anyway, If you really need it, you can extend either shelve.Shelf or my ListLike and use functools.lru_cache on the get function.

2 Comments

Many thanks for your answer! I would split the two options in separate answers, as I think they have very different characteristics: for one, the swap file solution requires admin rights.
Oh, I didn't realise that I could! I shall do so. Also; kinda surprised you still care about this question at all, 5 years later. I just wanted something to do this also, and couldn't find it elsewhere.
0

You could consider simply creating a large enough swap file to hold all of your data in your application at once.

Swap files (or page files) reserve some part of your disk space to be used if your applications use up too much RAM. This would allow the operating system itself to handle caching in RAM and moving things to disk without anything special within python.

Although, this requires admin rights to set up, is different for different operating systems, and requires allocating enough disk space to hold your entire dataset within your application at once.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.