0

I have a python lambda that I'm getting close to memory limits and I'm not super comfortable with that operationally.

Essentially the lambda reads a bunch of bytes, does some data examination to throw some of it out, decode to UTF-8 and then ultimately indexes into ES. Some pseudo code

bytes = s3_resource.Object(bucket, key).get(Range=some_byte_range)['Body'].read()
bytes = find_subset_of_bytes(bytes)
for line in bytes.decode('utf-8').split():
    # do stuff w/ line

My guess is that one optimization I can do is to not decode the entire bytes section but only parts at a time. Decoding the entire thing essentially doubles the memory footprint.

Will memory improve if I do something like

for byte_line in bytes.split('\n'.encode('utf-8')):
     line = line.decode('utf-8')
     # do stuff w/ line

But is the split on bytes effective? Will that create a nice stream object or does it create the whole thing at once?

2 Answers 2

2

As per the docs, split returns a list, not a generator. You read one byte at a time and maintain your own line buffer, though, something like:

def get_lines_buffer(bytes_):
    buff = bytearray()
    for b in bytes_:
        if b == b'\n':
            yield buff.decode('utf-8')
            buff = bytearray()
        else:
            buff.append(b)
    if buff:
        yield buff.decode('utf-8')  # yield remaining buffer


for line in get_lines_buffer(b'123\n456\n789'):
    print(line)

Or here's your find method:

def get_lines_find(bytes_):
    a, b = 0, 0
    while b < len(bytes_):
        b = bytes_.find(b'\n', a)
        if b == -1:
            b = len(bytes_)  # no further matches
        s = bytes_[a:b]
        a = b + 1
        yield s.decode('utf-8')

for line in get_lines_find(b'123\n456\n789'):
    print(line)

Comparing the two:

data = b'123\n456\n789\n' * int(1e5)


def test_buffer():
    for _ in get_lines_buffer(data):
        pass


def test_find():
    for _ in get_lines_find(data):
        pass


if __name__ == '__main__':
    import timeit

    time_buffer = timeit.timeit(
        "test_buffer()",
        setup="from __main__ import test_buffer",
        number=5)
    print(f'buffer method: {time_buffer:.3f}s')

    time_find = timeit.timeit(
        "test_find()",
        setup="from __main__ import test_find",
        number=5)
    print(f'find method: {time_find:.3f}s')

Performance seems to be a bit slower with the "find" method:

buffer method: 8.027s
find method: 10.370s

Also note that bytes is a built-in name, you shouldn't use that as a variable name.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for your solution. Could I ask about performance comparison to doing something like using find to find the indices where \n exists and doing slicing?
Also running this code seems to yield int in the for loop, not the byte itself. Think it needs to be cast into a byte
Updated my answer for Python 3, and testing your find suggestion
Errm, I got the find logic wrong, updated yet again
0

Tested your idea with memory_profiler:

from memory_profiler import profile

byte_list = b"some bytes\n" * 100000


@profile
def decode_split():
    for line in byte_list.decode().split():
        pass


@profile
def split_encode():
    for line in byte_list.split("\n".encode()):
        pass


decode_split()
split_encode()

Output:

Line #    Mem usage    Increment   Line Contents
================================================
     6     12.1 MiB     12.1 MiB   @profile
     7                             def decode_split():
     8     37.3 MiB     25.2 MiB       for line in byte_list.decode().split():
     9     37.3 MiB      0.0 MiB           pass

Line #    Mem usage    Increment   Line Contents
================================================
    12     17.2 MiB     17.2 MiB   @profile
    13                             def split_encode():
    14     20.7 MiB      3.5 MiB       for line in byte_list.split("\n".encode()):
    15     20.7 MiB      0.0 MiB           pass

So yes, encoding the delimiter instead of decoding the bytes saves memory and might be good enough for your purposes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.