I am looking for (Python interface to) an iterable data structure that can hold a large quantity of items. Ideally, the memory used by all the items in the list is larger than the available RAM: objects are transparently swapped in and out some disk file as they are accessed; only a small configurable number of them are loaded in RAM at any given time. In other words, I would like to see something like C++'s STXXL library, but I only need a list-like container.
Furthermore, the data structure needs to allow: storing arbitrary Python objects, adding/removing
elements (either by position or by value), iterating over all
elements, in/__contain__ checks, and (possibly) a quick way to
select elements satisfying a simple attribute equality predicate
(e.g., x.foo = 'bar')
Here's an example of the API that I would like to see::
# persist list data to `foo.dat`, keep 100 items in memory
l = FileBackedList('foo.dat', 100)
# normal Python list operations work as expected
l.append('x'); len(l) == 1
l.extend([1, 2, 3])
l.remove('x'); len(l) == 3
l.pop(0); len(l) == 2
2 in l # => True
# there should be at least one way of doing the following
k = [item for item in l if item > 2]
k = filter(l, lambda item: item > 2)
It is acceptable that the implementation is not particularly fast or efficient; the ability to handle large amounts of objects with constrained memory is paramount.
Before I start rolling out my own implementation, is there any existing library that I can already plug into my app? Or at least some code to take inspiration from?