I have a python lambda that I'm getting close to memory limits and I'm not super comfortable with that operationally.
Essentially the lambda reads a bunch of bytes, does some data examination to throw some of it out, decode to UTF-8 and then ultimately indexes into ES. Some pseudo code
bytes = s3_resource.Object(bucket, key).get(Range=some_byte_range)['Body'].read()
bytes = find_subset_of_bytes(bytes)
for line in bytes.decode('utf-8').split():
# do stuff w/ line
My guess is that one optimization I can do is to not decode the entire bytes section but only parts at a time. Decoding the entire thing essentially doubles the memory footprint.
Will memory improve if I do something like
for byte_line in bytes.split('\n'.encode('utf-8')):
line = line.decode('utf-8')
# do stuff w/ line
But is the split on bytes effective? Will that create a nice stream object or does it create the whole thing at once?