Python Memory Leak Using binascii, zlib, struct, and numpy

Question

I have a python script which is processing a large amount of data from compressed ASCII. After a short period, it runs out of memory. I am not constructing large lists or dicts. The following code illustrates the issue:

import struct
import zlib
import binascii
import numpy as np
import psutil
import os
import gc

process = psutil.Process(os.getpid())
n = 1000000
compressed_data = binascii.b2a_base64(bytearray(zlib.compress(struct.pack('%dB' % n, *np.random.random(n))))).rstrip()

print 'Memory before entering the loop is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
for i in xrange(2):
    print 'Memory before iteration %d is %d MB' % (i, process.get_memory_info()[0] / float(2 ** 20))
    byte_array = zlib.decompress(binascii.a2b_base64(compressed_data))
    a = np.array(struct.unpack('%dB' % (len(byte_array)), byte_array))
    gc.collect()
gc.collect()
print 'Memory after last iteration is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))

It prints:

Memory before entering the loop is 45 MB
Memory before iteration 0 is 45 MB
Memory before iteration 1 is 51 MB
Memory after last iteration is 51 MB

Between the first and second iteration, 6 MB of memory get created. If i run the loop more than two times, the memory usage stays at 51 MB. If I put the code to decompress into its own function and feed it the actual compressed data, the memory usage will continue to grow. I am using Python 2.7. Why is the memory increasing and how can it be corrected? Thank you.

I wouldn't say, that is a memory leak, it is normal memory consumption. — Daniel
– Daniel, Commented Dec 1, 2014 at 20:47
Besides looking quite normal, as @Daniel said, how about the byte_array and the a = np.array? Your first iteration outputs the memory usage before instantiating them. That sounds like a lot of data, which is likely not to be destroyed by the garbage collector because you call it within the for loop scope. Unindent (move left) that gc.collect() so it runs outside the for loop, and see what happens. — Savir
– Savir, Commented Dec 1, 2014 at 20:49
@BorrajaX added another gc.collect before the last print and after the loop exits, no change. For all the print statements the byte_array and "a" variables shouldnt exist in memory — user2133814
– user2133814, Commented Dec 1, 2014 at 20:55
sorry, sorry... Even after the for loop, byte_array and a are in your scope (my bad, they don't get destroyed). Right after the loop ends (and before your second gc.collect() that you just added) do byte_array = None a=None... Now I'm curious myself :-) — Savir
– Savir, Commented Dec 1, 2014 at 20:57
@BorrajaX added in those set to None statements and it cleared the memory, fixing the concern i had. I misunderstood Python scoping, I'm more used to Java. Anyways, i still have an issue in my code but the above example doesn't correctly show it. Thanks — user2133814
– user2133814, Commented Dec 1, 2014 at 21:06

Community · Accepted Answer · 2017-05-23 12:20:32Z

Through comments, we figured out what was going on:

The main issue is that variables declared in a for loop are not destroyed once the loop ends. They remain accessible, pointing to the value they received in the last iteration:

>>> for i in range(5):
...     a=i
...
>>> print a
4

So here's what's happening:

First iteration: The print is showing 45MB, which the memory before instantiating byte_array and a.
The code instantiates those two lengthy variables, making the memory go to 51MB
Second iteration: The two variables instantiated in the first run of the loop are still there.
In the middle of the second iteration, byte_array and a are overwritten by the new instantiation. The initial ones are destroyed, but substituted by equally lengthy variables.
The for loop ends, but byte_array and a are still accessible in the code, therefore, not destroyed by the second gc.collect() call.

Changing the code to:

for i in xrange(2):
   [ . . . ]
byte_array = None
a = None
gc.collect()

made the memory resreved by byte_array and a unaccessible, and therefore, freed.

There's more on Python's garbage collection in this SO answer: https://stackoverflow.com/a/4484312/289011

Also, it may be worth looking at How do I determine the size of an object in Python?. This is tricky, though... if your object is a list pointing to other objects, what is the size? The sum of the pointers in the list? The sum of the size of the objects those pointers point to?

Collectives™ on Stack Overflow

Python Memory Leak Using binascii, zlib, struct, and numpy

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related