2

I'm trying to compare two lists of MD5 hashes and identify matches. One of these lists contains approximately 34,000,000 hashes, and the other could contain up to 1,000,000.

Using Numpy I've been experimenting with the time is takes to conduct the comparison vs a standard python array, and the performance difference is very impressive. The experimental hash datasets ive been using only contain 1,000,000 each, but when I try and simulate the target dataset of 34,000,000, the python script is returning the following error:

process started - [2017-01-01 11:23:18]
Traceback (most recent call last):
File "compare_data.py", line 30, in <module>
compare_array_to_array()
File "compare_data.py", line 24, in compare_array_to_array
np_array_01 = read_sample_data_01_array()
File "compare_data.py", line 9, in read_sample_data_01_array
np_array_01 = np.array(sample_data_01)
MemoryError

I've had a look at other posts regards Numpy Memory Errors but I'm struggling to understand how the problem is resolved, so I apologize in advance that this question may have been asked before.

The full script is as follows:

from datetime import datetime
import numpy as np

def read_sample_data_01_array():
    sample_data_01 = []
    with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
        for line in fi:
            sample_data_01.append(line)
    np_array_01 = np.array(sample_data_01)
    return(np_array_01)


def read_sample_data_02_array():
    sample_data_02 = []
    with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
        for line in fi:
            sample_data_02.append(line)
    np_array_02 = np.array(sample_data_02)
    return(np_array_02)


def compare_array_to_array():
    np_array_02 = read_sample_data_02_array()
    np_array_01 = read_sample_data_01_array()
    ct = np.sum(np_array_01 == np_array_02)
    print(ct)


print("process started - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))    
compare_array_to_array()
print("process completed - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))

The current workstation that this code is running from is 32bit as is Python, that said Ubuntu can see 8GB of RAM, although I suspect only 4 of this is addressable? The target workstation contains 64GB of RAM and is a 64bit system, however I would like to cater for the lesser system

Heres an example of the strings contained within the datasets I'm trying to compare:

XDPUXSRCYCYTYEUPQMGXEDPHQTDOCLEK
FPNXWZTALSZULLWLPCMTGTAKIGMCAMFT
NKIOLIINPXBMUUOLYKWWBCIXCIVKWCPO
DPLIXJCFJOKLSZUSRPPBDBAADXEHEDZL
NGIMMXKXZHIQHTCXRPKGWYPUPJMAJAPQ

Many thanks

2
  • I have answered but I have a doubt now. Could you post a sample of the input files? Commented Jan 1, 2017 at 12:20
  • Sure, I wrote a script to generate 32 character strings which were as close as I could simulate MD5 hashes: Commented Jan 1, 2017 at 13:01

1 Answer 1

1

With that routine:

def read_sample_data_01_array():
    sample_data_01 = []
    with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
        for line in fi:
            sample_data_01.append(line)
    np_array_01 = np.array(sample_data_01)
    return(np_array_01)

you're creating a native python list first, then you create a numpy array with that. At some point, the memory you need is roughly the double of the final needed memory, which may explain you run out of memory.

You could read the file directly using numpy.loadtxt but I suspect that it takes the same amount of memory because it computes data size automatically, so here's my solution: using fromiter and specify data type as "<U32" so numpy knows it allocates at most 32 bytes per cell (since it cannot know the maximum size in advance because of the iteration process, which is also why it saves memory):

def read_sample_data_array(filepath):
   with open(filepath) as f:
    array = np.fromiter((line.strip() for line in f),dtype="<U32")
   return array

(note the generator comprehension expression (line.strip() for line in f) which doesn't allocate memory at all unlike [line.strip() for line in f] would.

then:

np_array_02 = read_sample_data_array("data_set_02.txt")
np_array_01 = read_sample_data_array("data_set_01.txt")

However, this still takes too much memory, so let me propose a non-numpy alternative, using only base python:

with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
    s = {line.strip() for line in fi}

with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
    number_of_matches = sum(1 for line in fi if line.strip() in s)
  1. read the "small" data set and put it in a set for fast lookup
  2. read the "big" data set line by line (so no memory consumption) and count matches using sum and a conditional generator comprehension

That should do the trick and be relatively fast even if the big data set is even bigger. You just don't need the big dataset to be in memory.

Sign up to request clarification or add additional context in comments.

7 Comments

I'm afraid your solution is returning a memory error:
thanks to your edit, I could adapt my solution. Should be better memory-wise. Please try it.
I've tried your alteration, and I now receive the error: MemoryError: cannot allocate array memory. The error appears to relate to this line of code: array = np.fromiter((line.strip() for line in f),dtype="<U32")
see my edit, I think the last solution will work for you.
Spot on! 18 seconds it look to conduct the comparison. Thats impressive in my book, so thank you very much :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.