1

I am working on a problem of deduplicating a large set of strings in Python and am tackling the problem with sets.Set(). The input is a set of strings from a text file and output are the same set of strings with duplicates removed.

The script needs to be able to run on a machine with limited main memory(around 2GB) and the problem is that the size of the set gets too big, my input is a 800mb text file.

Part of my code:

for String in InputFile:
    StringSet.add(String)

return StringSet

Is there a more efficient way around this problem? I've considered a bloom filter and trie but I'd prefer the O(1) efficiency of Set().

Edit: I've switched from sets.Set() to set(), the latter which is supposed to be more memory efficient, but still not efficient enough.

6
  • How do you know it's running out of memory because of the set size - what actually happens? Is that exactly your code? File iteration is a generator already so there's not much left you can change; are you keeping 'String' around too long and it isn't being collected? Are you open to trying third party extensions e.g. pypi.python.org/pypi/blist ? (not sure if relevant, isn't O(1), but is an example). Commented Dec 5, 2014 at 6:26
  • The above is a greatly simplified version of my code. I know I'm running out of memory because a MemoryError is thrown for StringSet on a machine with less than 4GB of main memory but things work fine on my 16GB desktop. Commented Dec 5, 2014 at 6:31
  • What is wrong with set? Why are you going for sets.Set (which is deprecated btw)? Commented Dec 5, 2014 at 6:31
  • @Smac89 Ahhh....didn't noticed that sets.Set() is deprecated. Just read about it in the Python documentation and apparently set() has a more memory efficient pickle. Shall try it out and see if it solves my problem. Thanks Commented Dec 5, 2014 at 6:36
  • @gamerx have you seen pythonhosted.org/Pympler for investigating memory use of Python objects? I haven't used it, but maybe it can help you track down which things are using the most memory and see if anything surprising is also taking up space. Commented Dec 5, 2014 at 6:40

1 Answer 1

5

Rather than storing the existing strings to remove duplicates you could do a two file mechanism, on the reading a line, calculate a hash, if the hash is not in the existing hash set the line would be written to the output file and the hash added to the hash set, as you would only need to have the table of hashes and a single line in memory at any one time this would be more memory efficient unless your lines are smaller than the hash length. You should also find that it will be a lot faster as comparing even 64 byte hash entries is a lot quicker than comparing all the characters in two lines.

Some thing like:

import hashlib
hasher = hashlib.sha1 # You could use md5, SHA224, SHA256, SHA384 or SHA512
hashset = set() # You could also use a list here - speed V size would be interesting
with open(input_name, 'r') as infile:
    with open(output_name, 'w') as outfile:
        for line in infile.readlines():
            h = hasher(line).digest() # Hash the line
            if not h in hashset:
                outfile.write(line)
                hashset.add(h) # if using a list append(h)

The only question is balancing the hash type and length between size and chance of duplication or collision. For information the digest sizes in bytes and claimed chances of the same value for different input are:

Hash Bytes    Security Claim
md5    16     1:2^64
sha1   20     1:2^80
sha224 28 
sha256 32     1:2^128
sha384 48     
sha512 64     1:2^256
Sign up to request clarification or add additional context in comments.

6 Comments

You must be careful about hash collisions! Different strings may map to the same hash value, especially if you're using small hashes like md5.
@Rob - that is why I had the point about balancing the hash length v chance of duplication. I will add to this.
This does not work. Internally python uses hashtables to implement sets, and therefore the size of items in the set doesn't matter (your hash values will be hashed again by python before being stored in a set: A set is basically just a dict with dummy values)
@tomas - the point is that each line can potentially be 100s or thousands of bytes while your stored hash will be a fixed length for each entry - the original line can be discarded once you have calculated the hash whereas if you are adding the lines to the set they will still all be in RAM.
@SteveBarnes You are right of course, sorry. What I was thinking was that python will hash those lines anyways when constructing the set (so it's either a hash of the lines, or a hash of the hash of the lines). But obviously this only accounts for the length of the hashtable. The real values are of course saved there as well, otherwise we wouldn't be able to retrieve them :p. stackoverflow will not let me remove the downvote unless the answer is editted, which is embarrassing for me. Maybe you can make a small change so I can remove it? apologies.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.