0

I create a long list of python objects, with many identical entries. In order to save some memory, I want to ensure that identical elements share their memory. For example:

In [1]: alist = [(3, 4), (3, 4)]

In [2]: alist[0] == alist[1]
Out[2]: True

In [3]: alist[0] is alist[1]
Out[3]: False

The two list elements are equal, but use separate memory. In this simple example the problem can be fixed by

In [4]: alist[1] = alist[0]

In [5]: alist[1] is alist[0]
Out[5]: True

How can this be done in a more general way? The list is not sorted, so identical elements are usually not next to each other. One solution I came up with is this:

g_dupdict = dict()
def dedup(x):
  try:
    return g_dupdict[x]
  except KeyError:
    g_dupdict[x] = x
    return x

for k in range(len(alist)):
  alist[k] = dedup(alist[k])

This works, but introduces new problems. It seems silly to use a dict, a set should do, but I don't know how to make this work. And then the dict holds an additional reference to each object, so the memory doesn't get freed when the elements are deleted from the list. To fix this, I delete the dict occasionally, but as a result it must be re-created whenever new elements get added to the list. Is there is better solution to this problem? Thanks.

2
  • Do you really need this? Yes, it can save some memory, but it may actually use more memory (depending on how big your real data items are, and how common dupes are) and it will make your code slower & more complex (and thus creating more places for bugs to lurk). But anyway, I can't think of a better way to do this, although you can make the dedup function more efficient by using g_dupdict.setdefault(x, x). Commented Aug 12, 2017 at 12:16
  • I have to process a huge data set, without this only 2% of it fits into memory, and with this optimization I can work with 15% of it. Commented Aug 14, 2017 at 8:06

1 Answer 1

1

I imagine you are manipulating huge data structures in order to be concern about that. One way to keep g_dedupdict scoped to a given data structure is to encapsulate it inside your own list implementation:

class CachedList(list):
    def __init__(self, iterable=(), cache=None):
        self.__cache = cache or {}
        super().__init__([self.__cached(item) for item in iterable])

    def append(self, e):
        super().append(self.__cached(e))

    def clear(self):
        self.__cache.clear()
        super().clear()

    def copy(self):
        return CachedList(self, self.__cache)

    def extend(self, iterable):
        super().extend([self.__cached(item) for item in iterable])

    def insert(self, index, e):
        super().insert(index, self.__cached(e))

    def remove(self, e):
        super().remove(self.__cached(e))
        del self.__cache[hash(e)]

    def __cached(self, e):
        if hash(e) not in self.__cache:
            self.__cache[hash(e)] = e
        return self.__cache[hash(e)]

    def __add__(self, iterable):
        cached_iterable = [self.__cached(item) for item in iterable]
        return CachedList(super().__add__(cached_iterable), self.__cache)

    def __delitem__(self, e):
        super().__delitem__(self.__cached(e))
        del self.__cache[hash(e)]

    def __iadd__(self, e):
        self.append(self.__cached(e))
        return self

    def __setitem__(self, index, e):
        super().__setitem__(index, self.__cached(e))


if __name__ == '__main__':
    e1 = (3, 4)
    e2 = (3, 4)
    assert e1 == e2
    assert e1 is not e2
    assert hash(e1) == hash(e2)

    alist = CachedList([(3, 4), (3, 4)])
    assert alist[0] == alist[1]
    assert alist[0] is alist[1]

    alist[0] = (3, 4)
    assert alist[0] == alist[1]
    assert alist[0] is alist[1]

    alist += (3, 4)
    assert alist[0] == alist[-1]
    assert alist[0] is alist[-1]

    newlist = alist + [(3, 4)]
    assert newlist[0] == newlist[-1]
    assert newlist[0] is newlist[-1]

The implementation may not be complete, but I think can get the idea from the code snippet above.

Sign up to request clarification or add additional context in comments.

1 Comment

FWIW, it's generally simpler (and just as effective) to subclass collections.UserList than to subclass list, although I guess there may be some performance benefits in subclassing list itself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.