Memory Efficient Way of Using set() in Python

Question

I am working on a problem of deduplicating a large set of strings in Python and am tackling the problem with sets.Set(). The input is a set of strings from a text file and output are the same set of strings with duplicates removed.

The script needs to be able to run on a machine with limited main memory(around 2GB) and the problem is that the size of the set gets too big, my input is a 800mb text file.

Part of my code:

for String in InputFile:
    StringSet.add(String)

return StringSet

Is there a more efficient way around this problem? I've considered a bloom filter and trie but I'd prefer the O(1) efficiency of Set().

Edit: I've switched from sets.Set() to set(), the latter which is supposed to be more memory efficient, but still not efficient enough.

How do you know it's running out of memory because of the set size - what actually happens? Is that exactly your code? File iteration is a generator already so there's not much left you can change; are you keeping 'String' around too long and it isn't being collected? Are you open to trying third party extensions e.g. pypi.python.org/pypi/blist ? (not sure if relevant, isn't O(1), but is an example). — TessellatingHeckler
– TessellatingHeckler, Commented Dec 5, 2014 at 6:26
The above is a greatly simplified version of my code. I know I'm running out of memory because a MemoryError is thrown for StringSet on a machine with less than 4GB of main memory but things work fine on my 16GB desktop. — gamerx
– gamerx, Commented Dec 5, 2014 at 6:31
What is wrong with set? Why are you going for sets.Set (which is deprecated btw)? — smac89
– smac89, Commented Dec 5, 2014 at 6:31
@Smac89 Ahhh....didn't noticed that sets.Set() is deprecated. Just read about it in the Python documentation and apparently set() has a more memory efficient pickle. Shall try it out and see if it solves my problem. Thanks — gamerx
– gamerx, Commented Dec 5, 2014 at 6:36
@gamerx have you seen pythonhosted.org/Pympler for investigating memory use of Python objects? I haven't used it, but maybe it can help you track down which things are using the most memory and see if anything surprising is also taking up space. — TessellatingHeckler
– TessellatingHeckler, Commented Dec 5, 2014 at 6:40

Steve Barnes · Accepted Answer · 2015-12-08 18:16:43Z

5

Rather than storing the existing strings to remove duplicates you could do a two file mechanism, on the reading a line, calculate a hash, if the hash is not in the existing hash set the line would be written to the output file and the hash added to the hash set, as you would only need to have the table of hashes and a single line in memory at any one time this would be more memory efficient unless your lines are smaller than the hash length. You should also find that it will be a lot faster as comparing even 64 byte hash entries is a lot quicker than comparing all the characters in two lines.

Some thing like:

import hashlib
hasher = hashlib.sha1 # You could use md5, SHA224, SHA256, SHA384 or SHA512
hashset = set() # You could also use a list here - speed V size would be interesting
with open(input_name, 'r') as infile:
    with open(output_name, 'w') as outfile:
        for line in infile.readlines():
            h = hasher(line).digest() # Hash the line
            if not h in hashset:
                outfile.write(line)
                hashset.add(h) # if using a list append(h)

The only question is balancing the hash type and length between size and chance of duplication or collision. For information the digest sizes in bytes and claimed chances of the same value for different input are:

Hash Bytes    Security Claim
md5    16     1:2^64
sha1   20     1:2^80
sha224 28 
sha256 32     1:2^128
sha384 48     
sha512 64     1:2^256

edited Dec 8, 2015 at 18:16

answered Dec 5, 2014 at 7:40

Steve Barnes

28.5k6 gold badges68 silver badges80 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Rob Over a year ago

You must be careful about hash collisions! Different strings may map to the same hash value, especially if you're using small hashes like md5.

Steve Barnes Over a year ago

@Rob - that is why I had the point about balancing the hash length v chance of duplication. I will add to this.

tomas Over a year ago

This does not work. Internally python uses hashtables to implement sets, and therefore the size of items in the set doesn't matter (your hash values will be hashed again by python before being stored in a set: A set is basically just a dict with dummy values)

Steve Barnes Over a year ago

@tomas - the point is that each line can potentially be 100s or thousands of bytes while your stored hash will be a fixed length for each entry - the original line can be discarded once you have calculated the hash whereas if you are adding the lines to the set they will still all be in RAM.

tomas Over a year ago

@SteveBarnes You are right of course, sorry. What I was thinking was that python will hash those lines anyways when constructing the set (so it's either a hash of the lines, or a hash of the hash of the lines). But obviously this only accounts for the length of the hashtable. The real values are of course saved there as well, otherwise we wouldn't be able to retrieve them :p. stackoverflow will not let me remove the downvote unless the answer is editted, which is embarrassing for me. Maybe you can make a small change so I can remove it? apologies.

|

Collectives™ on Stack Overflow

Memory Efficient Way of Using set() in Python

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related