using strings as python dictionaries (memory management)

Question

I need to find identical sequences of characters in a collection of texts. Think of it as finding identical/plagiarized sentences. The naive way is something like this:

ht = defaultdict(int)
for s in sentences:
    ht[s]+=1

I usually use python but I'm beginning to think that python is not the best choice for this task. Am I wrong about it? is there a reasonable way to do it with python?

If I understand correctly, python dictionaries use open addressing which means that the key itself is also saved in the array. If this is indeed the case, it means that a python dictionary allows efficient lookup but is VERY bad in memory usage, thus if I have millions of sentences, they are all saved in the dictionary which is horrible since it exceeds the available memory - making the python dictionary an impractical solution.

Can someone approve the former paragraph?

One solution that comes into mind is explicitly using a hash function (either use the builtin hash function, implement one or use the hashlib module) and instead of inserting ht[s]+=1, insert: ht[hash(s)]+=1

This way the key stored in the array is an int (that will be hashed again) instead of the full sentence.

Will that work? Should I expect collisions? any other Pythonic solutions?

Thanks!

hmmm, for some reason the FF wouldn't let me cast votes or add comments. That's why you have more than a single browser installed. (chrome works). — ScienceFriction
– ScienceFriction, Commented Jul 20, 2011 at 5:57

Wai Yip Tung · Accepted Answer · 2011-07-19 20:56:19Z

2

Yes, dict store the key in memory. If you data fit in memory this is the easiest approach.
Hash should work. Try MD5. It is 16 byte int so collision is unlikely.
Try BerkeleyDB for a disk based approach.

answered Jul 19, 2011 at 20:56

Wai Yip Tung

18.9k10 gold badges46 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

GaretJax Over a year ago

Why BerkeleyDB? Couldn't redis be a better option?

Michael Dillon Over a year ago

BerkeleyDB uses btree indexes so that you have sorted sequences of keys very efficiently. It also handles guaranteed persistence. Redis is not a database, it is a network protocol service. Sending the data to another server is not the same as writing it to a disk. Of course Redis can be configured to write to a disk as well, but if that is your goal, why bother with Redis in the middle?

ScienceFriction Over a year ago

how is BerkeleyDB comparing to SQLite?

Wai Yip Tung Over a year ago

I'm not making any in-depth DB comparison here. I suggest BerkeleyDB because it works like a disk based persistent dictionary and it comes for free with Python, as least up to 2.x, for free. It does the job with minimal effort, that's it.

ScienceFriction Over a year ago

sqlite comes free with python as well. It works as part of the python and no communication with a (even local) server is needed. Unlike BDB, it allows relational operations. I'll be experimenting with one of them so I'm sniffing around.

tomasz · Accepted Answer · 2011-07-19 22:40:24Z

Python dicts are indeed monsters in memory. You hardly can operate in millions of keys when storing anything larger than integers. Consider following code:

for x in xrange(5000000): # it's 5 millions
  d[x] = random.getrandbits(BITS)

For BITS(64) it takes 510MB of my RAM, for BITS(128) 550MB, for BITS(256) 650MB, for BITS(512) 830MB. Increasing number of iterations to 10 millions will increase memory usage by 2. However, consider this snippet:

for x in xrange(5000000): # it's 5 millions
  d[x] = (random.getrandbits(64), random.getrandbits(64))

It takes 1.1GB of my memory. Conclusion? If you want to keep two 64-bits integers, use one 128-bits integer, like this:

for x in xrange(5000000): # it's still 5 millions
  d[x] = random.getrandbits(64) | (random.getrandbits(64) << 64)

It'll reduce memory usage by two.

It depends on your actual memory limit and number of sentences, but you should be safe with using dictionaries with 10-20 millions of keys when using just integers. You have a good idea with hashes, but probably want to keep pointer to the sentence, so in case of collision you can investigate (compare the sentence char by char and probably print it out). You could create a pointer as a integer, for example by including number of file and offset in it. If you don't expect massive number of collision, you can simply set up another dictionary for storing only collisions, for example:

hashes = {}
for s in sentence:
  ptr_value = pointer(s)  # make it integer
  hash_value = hash(s)    # make it integer

  if hash_value in hashes:
    collisions.setdefault(hashes[hash_value], []).append(ptr_value)
  else:
    hashes[hash_value] = ptr_value

So at the end you will have collisions dictionary where key is a pointer to sentence and value is an array of pointers the key is colliding with. It sounds pretty hacky, but working with integers is just fine (and fun!).

fransua · Accepted Answer · 2011-07-19 20:57:52Z

0

perhaps passing keys to md5 http://docs.python.org/library/md5.html

answered Jul 19, 2011 at 20:57

fransua

1,61814 silver badges30 bronze badges

Comments

sampwing · Accepted Answer · 2011-07-19 21:17:31Z

0

Im not sure exactly how large your data set you are comparing all between is, but I would recommend looking into bloom filters (be careful of false positives). http://en.wikipedia.org/wiki/Bloom_filter ... Another avenue to consider would be something simple like cosine similarity or edit distance between documents, but if you are trying to compare one document with many... I would suggest looking into bloom filters, you can encode it however you find most efficient for your problem.

answered Jul 19, 2011 at 21:17

sampwing

1,2781 gold badge10 silver badges13 bronze badges

Collectives™ on Stack Overflow

using strings as python dictionaries (memory management)

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related