I'm trying to compare two lists of MD5 hashes and identify matches. One of these lists contains approximately 34,000,000 hashes, and the other could contain up to 1,000,000.
Using Numpy I've been experimenting with the time is takes to conduct the comparison vs a standard python array, and the performance difference is very impressive. The experimental hash datasets ive been using only contain 1,000,000 each, but when I try and simulate the target dataset of 34,000,000, the python script is returning the following error:
process started - [2017-01-01 11:23:18]
Traceback (most recent call last):
File "compare_data.py", line 30, in <module>
compare_array_to_array()
File "compare_data.py", line 24, in compare_array_to_array
np_array_01 = read_sample_data_01_array()
File "compare_data.py", line 9, in read_sample_data_01_array
np_array_01 = np.array(sample_data_01)
MemoryError
I've had a look at other posts regards Numpy Memory Errors but I'm struggling to understand how the problem is resolved, so I apologize in advance that this question may have been asked before.
The full script is as follows:
from datetime import datetime
import numpy as np
def read_sample_data_01_array():
sample_data_01 = []
with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
for line in fi:
sample_data_01.append(line)
np_array_01 = np.array(sample_data_01)
return(np_array_01)
def read_sample_data_02_array():
sample_data_02 = []
with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
for line in fi:
sample_data_02.append(line)
np_array_02 = np.array(sample_data_02)
return(np_array_02)
def compare_array_to_array():
np_array_02 = read_sample_data_02_array()
np_array_01 = read_sample_data_01_array()
ct = np.sum(np_array_01 == np_array_02)
print(ct)
print("process started - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
compare_array_to_array()
print("process completed - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
The current workstation that this code is running from is 32bit as is Python, that said Ubuntu can see 8GB of RAM, although I suspect only 4 of this is addressable? The target workstation contains 64GB of RAM and is a 64bit system, however I would like to cater for the lesser system
Heres an example of the strings contained within the datasets I'm trying to compare:
XDPUXSRCYCYTYEUPQMGXEDPHQTDOCLEK
FPNXWZTALSZULLWLPCMTGTAKIGMCAMFT
NKIOLIINPXBMUUOLYKWWBCIXCIVKWCPO
DPLIXJCFJOKLSZUSRPPBDBAADXEHEDZL
NGIMMXKXZHIQHTCXRPKGWYPUPJMAJAPQ
Many thanks