Python Numpy Memory Error

Question

I'm trying to compare two lists of MD5 hashes and identify matches. One of these lists contains approximately 34,000,000 hashes, and the other could contain up to 1,000,000.

Using Numpy I've been experimenting with the time is takes to conduct the comparison vs a standard python array, and the performance difference is very impressive. The experimental hash datasets ive been using only contain 1,000,000 each, but when I try and simulate the target dataset of 34,000,000, the python script is returning the following error:

process started - [2017-01-01 11:23:18]
Traceback (most recent call last):
File "compare_data.py", line 30, in <module>
compare_array_to_array()
File "compare_data.py", line 24, in compare_array_to_array
np_array_01 = read_sample_data_01_array()
File "compare_data.py", line 9, in read_sample_data_01_array
np_array_01 = np.array(sample_data_01)
MemoryError

I've had a look at other posts regards Numpy Memory Errors but I'm struggling to understand how the problem is resolved, so I apologize in advance that this question may have been asked before.

The full script is as follows:

from datetime import datetime
import numpy as np

def read_sample_data_01_array():
    sample_data_01 = []
    with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
        for line in fi:
            sample_data_01.append(line)
    np_array_01 = np.array(sample_data_01)
    return(np_array_01)


def read_sample_data_02_array():
    sample_data_02 = []
    with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
        for line in fi:
            sample_data_02.append(line)
    np_array_02 = np.array(sample_data_02)
    return(np_array_02)


def compare_array_to_array():
    np_array_02 = read_sample_data_02_array()
    np_array_01 = read_sample_data_01_array()
    ct = np.sum(np_array_01 == np_array_02)
    print(ct)


print("process started - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))    
compare_array_to_array()
print("process completed - [{0}]".format(datetime.now().strftime('%Y-%m-%d %H:%M:%S')))

The current workstation that this code is running from is 32bit as is Python, that said Ubuntu can see 8GB of RAM, although I suspect only 4 of this is addressable? The target workstation contains 64GB of RAM and is a 64bit system, however I would like to cater for the lesser system

Heres an example of the strings contained within the datasets I'm trying to compare:

XDPUXSRCYCYTYEUPQMGXEDPHQTDOCLEK
FPNXWZTALSZULLWLPCMTGTAKIGMCAMFT
NKIOLIINPXBMUUOLYKWWBCIXCIVKWCPO
DPLIXJCFJOKLSZUSRPPBDBAADXEHEDZL
NGIMMXKXZHIQHTCXRPKGWYPUPJMAJAPQ

Many thanks

I have answered but I have a doubt now. Could you post a sample of the input files? — Jean-François Fabre
– Jean-François Fabre ♦, Commented Jan 1, 2017 at 12:20
Sure, I wrote a script to generate 32 character strings which were as close as I could simulate MD5 hashes: — Simon Ridley
– Simon Ridley, Commented Jan 1, 2017 at 13:01

Jean-François Fabre · Accepted Answer · 2017-01-01 14:37:30Z

1

With that routine:

def read_sample_data_01_array():
    sample_data_01 = []
    with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
        for line in fi:
            sample_data_01.append(line)
    np_array_01 = np.array(sample_data_01)
    return(np_array_01)

you're creating a native python list first, then you create a numpy array with that. At some point, the memory you need is roughly the double of the final needed memory, which may explain you run out of memory.

You could read the file directly using numpy.loadtxt but I suspect that it takes the same amount of memory because it computes data size automatically, so here's my solution: using fromiter and specify data type as "<U32" so numpy knows it allocates at most 32 bytes per cell (since it cannot know the maximum size in advance because of the iteration process, which is also why it saves memory):

def read_sample_data_array(filepath):
   with open(filepath) as f:
    array = np.fromiter((line.strip() for line in f),dtype="<U32")
   return array

(note the generator comprehension expression (line.strip() for line in f) which doesn't allocate memory at all unlike [line.strip() for line in f] would.

then:

np_array_02 = read_sample_data_array("data_set_02.txt")
np_array_01 = read_sample_data_array("data_set_01.txt")

However, this still takes too much memory, so let me propose a non-numpy alternative, using only base python:

with open("data_set_02.txt", "r") as fi: #1,000,000 hashes
    s = {line.strip() for line in fi}

with open("data_set_01.txt", "r") as fi: #34,000,000 hashes
    number_of_matches = sum(1 for line in fi if line.strip() in s)

read the "small" data set and put it in a set for fast lookup
read the "big" data set line by line (so no memory consumption) and count matches using sum and a conditional generator comprehension

That should do the trick and be relatively fast even if the big data set is even bigger. You just don't need the big dataset to be in memory.

edited Jan 1, 2017 at 14:37

answered Jan 1, 2017 at 12:02

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Simon Ridley Over a year ago

I'm afraid your solution is returning a memory error:

Jean-François Fabre Over a year ago

thanks to your edit, I could adapt my solution. Should be better memory-wise. Please try it.

Simon Ridley Over a year ago

I've tried your alteration, and I now receive the error: MemoryError: cannot allocate array memory. The error appears to relate to this line of code: array = np.fromiter((line.strip() for line in f),dtype="<U32")

Jean-François Fabre Over a year ago

see my edit, I think the last solution will work for you.

Simon Ridley Over a year ago

Spot on! 18 seconds it look to conduct the comparison. Thats impressive in my book, so thank you very much :)

|

Collectives™ on Stack Overflow

Python Numpy Memory Error

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related