Stream split of a byte array in python

Question

I have a python lambda that I'm getting close to memory limits and I'm not super comfortable with that operationally.

Essentially the lambda reads a bunch of bytes, does some data examination to throw some of it out, decode to UTF-8 and then ultimately indexes into ES. Some pseudo code

bytes = s3_resource.Object(bucket, key).get(Range=some_byte_range)['Body'].read()
bytes = find_subset_of_bytes(bytes)
for line in bytes.decode('utf-8').split():
    # do stuff w/ line

My guess is that one optimization I can do is to not decode the entire bytes section but only parts at a time. Decoding the entire thing essentially doubles the memory footprint.

Will memory improve if I do something like

for byte_line in bytes.split('\n'.encode('utf-8')):
     line = line.decode('utf-8')
     # do stuff w/ line

But is the split on bytes effective? Will that create a nice stream object or does it create the whole thing at once?

mikewatt · Accepted Answer · 2019-12-13 22:43:37Z

2

As per the docs, split returns a list, not a generator. You read one byte at a time and maintain your own line buffer, though, something like:

def get_lines_buffer(bytes_):
    buff = bytearray()
    for b in bytes_:
        if b == b'\n':
            yield buff.decode('utf-8')
            buff = bytearray()
        else:
            buff.append(b)
    if buff:
        yield buff.decode('utf-8')  # yield remaining buffer


for line in get_lines_buffer(b'123\n456\n789'):
    print(line)

Or here's your find method:

def get_lines_find(bytes_):
    a, b = 0, 0
    while b < len(bytes_):
        b = bytes_.find(b'\n', a)
        if b == -1:
            b = len(bytes_)  # no further matches
        s = bytes_[a:b]
        a = b + 1
        yield s.decode('utf-8')

for line in get_lines_find(b'123\n456\n789'):
    print(line)

Comparing the two:

data = b'123\n456\n789\n' * int(1e5)


def test_buffer():
    for _ in get_lines_buffer(data):
        pass


def test_find():
    for _ in get_lines_find(data):
        pass


if __name__ == '__main__':
    import timeit

    time_buffer = timeit.timeit(
        "test_buffer()",
        setup="from __main__ import test_buffer",
        number=5)
    print(f'buffer method: {time_buffer:.3f}s')

    time_find = timeit.timeit(
        "test_find()",
        setup="from __main__ import test_find",
        number=5)
    print(f'find method: {time_find:.3f}s')

Performance seems to be a bit slower with the "find" method:

buffer method: 8.027s
find method: 10.370s

Also note that bytes is a built-in name, you shouldn't use that as a variable name.

edited Dec 13, 2019 at 22:43

answered Dec 13, 2019 at 21:22

mikewatt

6614 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

sedavidw Over a year ago

Thanks for your solution. Could I ask about performance comparison to doing something like using find to find the indices where \n exists and doing slicing?

sedavidw Over a year ago

Also running this code seems to yield int in the for loop, not the byte itself. Think it needs to be cast into a byte

mikewatt Over a year ago

Updated my answer for Python 3, and testing your find suggestion

mikewatt Over a year ago

Errm, I got the find logic wrong, updated yet again

ParkerD · Accepted Answer · 2019-12-13 21:56:22Z

Tested your idea with memory_profiler:

from memory_profiler import profile

byte_list = b"some bytes\n" * 100000


@profile
def decode_split():
    for line in byte_list.decode().split():
        pass


@profile
def split_encode():
    for line in byte_list.split("\n".encode()):
        pass


decode_split()
split_encode()

Output:

Line #    Mem usage    Increment   Line Contents
================================================
     6     12.1 MiB     12.1 MiB   @profile
     7                             def decode_split():
     8     37.3 MiB     25.2 MiB       for line in byte_list.decode().split():
     9     37.3 MiB      0.0 MiB           pass

Line #    Mem usage    Increment   Line Contents
================================================
    12     17.2 MiB     17.2 MiB   @profile
    13                             def split_encode():
    14     20.7 MiB      3.5 MiB       for line in byte_list.split("\n".encode()):
    15     20.7 MiB      0.0 MiB           pass

So yes, encoding the delimiter instead of decoding the bytes saves memory and might be good enough for your purposes.

Collectives™ on Stack Overflow

Stream split of a byte array in python

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related