Memory Efficient Hashmap Alternative to Python Dictionary (Integer to Integer)

Question

I am using a regular Python 3 dictionary to create a hashmap where both the keys and values are positive integers. The following code shows that a dict with about 6 million keys require 320 MB of memory.

import numpy as np
from sys import getsizeof

N = 10*1000*1000
a = np.random.randint(0, N, N)
b = np.random.randint(0, N, N)
d = dict(zip(a,b))

print('Number of elements:', len(d), 'Memory size (MB):', round(getsizeof(d)/2**20, 3))
print('Element memory size (B):', getsizeof(d[list(d.keys())[0]]))
# Number of elements: 6323010 Memory size (MB): 320.0
# Element memory size (B): 32

How can we create a more memory-efficient hashmap, ideally with O(1) lookup? The required hashmap can be immutable.

In my use case, the expected size of the hashmap can be up to 2 billion. Using Python dictionaries will require an estimated 64 GB of memory. Although this still fits into memory, we will still require some memory for other processes.

If it's possible, maybe use a list and treat the index as key? — Arunmozhi
– Arunmozhi, Commented Jul 5, 2020 at 16:25
@Arunmozhi I think that's possible. Additionally, maybe a lot of the memory usage is coming from Python representing the integers as objects instead of a true integer? — Athena Wisdom
– Athena Wisdom, Commented Jul 5, 2020 at 16:29
In that case you can simply make a linear lookup table (a nump array): 2^32 * 4 are 16 GB. And lookup will be faster than with a dict. And as a free bonus you can bulk lookup. — Paul Panzer
– Paul Panzer, Commented Jul 6, 2020 at 4:13

Paul Panzer · Accepted Answer · 2020-07-06 05:02:00Z

1

Given your numbers 2 * 10^9 key-value pairs of uint32 a memory addressed numpy lookup table will be hard to beat memory and speed wise as well as for sheer simplicity. The dead space will just be ~50% - roughly the same as the space you will be saving by not having to store the keys.

answered Jul 6, 2020 at 5:02

Paul Panzer

53.3k3 gold badges60 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Athena Wisdom Over a year ago

What is "the dead space" referring to?

Paul Panzer Over a year ago

@AthenaWisdom Addresses in the lookup table that do not correspond to a valid key. One would mark those with a special value.

Anonymous · Accepted Answer · 2020-07-06 17:01:10Z

1

The most memory-efficient way to store key / value pairs is as a list of pair of tuples/lists, but lookup of course will be very slow (even if you sort the list and use bisect for the lookup, it's still going to be extremely slower than a dict).

Consider using shelve instead -- that will use little memory (since the data reside on disk) and still offer pretty spiffy lookup performance (not as fast as an in-memory dict, of course, but for a large amount of data it will be much faster than lookup on a list of tuples, even a sorted one, can ever be!-).

edited Jul 6, 2020 at 17:01

answered Jul 5, 2020 at 16:45

Anonymous

5683 silver badges18 bronze badges

7 Comments

user2357112 Over a year ago

A list of tuples is not the most memory-efficient way to store key-value pairs, since it spends so much memory on tuples. It's easily beaten by a pair of lists, one for keys and one for values. Compressed options can do even better.

user2357112 Over a year ago

In fact, a dict easily beats a list of tuples in typical cases.

Athena Wisdom Over a year ago

@user2357112supportsMonica What do you mean by "compressed options"?

user2357112 Over a year ago

@AthenaWisdom: I was thinking about actually applying compression algorithms to compress your data, assuming it's actually compressible and not just random like your example.

user2357112 Over a year ago

Even without compression, you can store your data in a way that has less object overhead, like NumPy arrays or serialized bytestrings holding a large number of ints' worth of data instead of one object per integer.

|

Collectives™ on Stack Overflow

Memory Efficient Hashmap Alternative to Python Dictionary (Integer to Integer)

2 Answers 2

2 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related