I have a python script which is processing a large amount of data from compressed ASCII. After a short period, it runs out of memory. I am not constructing large lists or dicts. The following code illustrates the issue:
import struct
import zlib
import binascii
import numpy as np
import psutil
import os
import gc
process = psutil.Process(os.getpid())
n = 1000000
compressed_data = binascii.b2a_base64(bytearray(zlib.compress(struct.pack('%dB' % n, *np.random.random(n))))).rstrip()
print 'Memory before entering the loop is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
for i in xrange(2):
print 'Memory before iteration %d is %d MB' % (i, process.get_memory_info()[0] / float(2 ** 20))
byte_array = zlib.decompress(binascii.a2b_base64(compressed_data))
a = np.array(struct.unpack('%dB' % (len(byte_array)), byte_array))
gc.collect()
gc.collect()
print 'Memory after last iteration is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
It prints:
Memory before entering the loop is 45 MB
Memory before iteration 0 is 45 MB
Memory before iteration 1 is 51 MB
Memory after last iteration is 51 MB
Between the first and second iteration, 6 MB of memory get created. If i run the loop more than two times, the memory usage stays at 51 MB. If I put the code to decompress into its own function and feed it the actual compressed data, the memory usage will continue to grow. I am using Python 2.7. Why is the memory increasing and how can it be corrected? Thank you.
byte_arrayand thea = np.array? Your first iteration outputs the memory usage before instantiating them. That sounds like a lot of data, which is likely not to be destroyed by the garbage collector because you call it within theforloop scope. Unindent (move left) thatgc.collect()so it runs outside theforloop, and see what happens.forloop,byte_arrayandaare in your scope (my bad, they don't get destroyed). Right after the loop ends (and before your secondgc.collect()that you just added) dobyte_array = Nonea=None... Now I'm curious myself :-)