Split a long byte array into numpy array of strings

Question

Usually, when creating an numpy array of strings, we can do something like

import numpy as np
np.array(["Hello world!", "good bye world!", "whatever world"])
>>> array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15')

Now the question is, I am given a long bytearray from a foreign C function like this:

b'Hello world!\x00<some rubbish bytes>good bye world!\x00<some rubbish bytes>whatever world\x00<some rubbish bytes>'

It is guaranteed that every 32 bytes is a null-terminated string (i.e., there is a \x00 byte appended to the valid part of the string) and I need to convert this long bytearray to something like this, array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15'), preferably in-place (i.e., no memory copy).

This is what I do now：

for i in range(str_count):
    str_arr[i] = byte_arr[i * 32: (i+1) * 32].split(b'\x00')[0].decode('utf-8')
str_arr_np = np.array(str_arr),

It works, but it is kind of awkward and not done in-place (bytes are copied at least once, if not twice). Are there any better approaches?

can you please give a real example input with the real, expected output? How are we supposed to distinguish "rubbish bytes"? — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Oct 19, 2022 at 4:29
So every 32 bytres there is the beginning of some null-terminated string which can potentially take up to 32 bytes (but possibly less)? Again, please just give a real example — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Oct 19, 2022 at 4:34
Do you have access to the C side of the transform? It's easy enough to null-out the 32 byte buffer(s) (in C) before filling each with up to 31 characters (plus at least one terminating '\0')... — user17592432
– user17592432, Commented Oct 19, 2022 at 4:42
Hey @Fe2O3, based on your hint another user does provide a better approach! In my initial design, as I didnt know how it could help, so I let the memory content as indeterminate. But it can be memset()'ed to before use. — D.J. Elkind
– D.J. Elkind, Commented Oct 19, 2022 at 5:00
@Fe2O3 it wont, but it assumes that the buffer size is a mulitple of the item size (and will complain if it isn't) — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Oct 19, 2022 at 5:11

juanpa.arrivillaga · Accepted Answer · 2022-10-19 05:09:58Z

2

If you can zero out the data on the C side, then you can use np.frombuffer and it will be about as efficient as you can reasonably expect:

So, if you can zero out the data, then this can be read using numpy.frombuffer and it will probably be as efficient as you can reasonably expect to get:

>>> raw = b'hello world\x00\x00\x00\x00\x00Good Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(raw, dtype='S16')
array([b'hello world', b'Good Bye'], dtype='|S16')

Of course, this gives you a bytes string, not unicode string, although, that may be desirable in your case.

Note, the above relies on the built-in behavior of stripping trailing null bytes, if you have garbage afterwards, it won't work:

>>> data = b'hello world\x00aaaaGood Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(data, dtype='S16')
array([b'hello world\x00aaaa', b'Good Bye'], dtype='|S16')

Note, this shouldn't make a copy, notice:

>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr
array([b'hello world', b'Good Bye'], dtype='|S16')
>>> arr[0] = b"z"*16
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only

However, if the destination is not read-only, so say you had a bytearray to begin with:

>>> raw = bytearray(raw)
>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr[0] = b"z"*16
>>> arr
array([b'zzzzzzzzzzzzzzzz', b'Good Bye'], dtype='|S16')
>>> raw
bytearray(b'zzzzzzzzzzzzzzzzGood Bye\x00\x00\x00\x00\x00\x00\x00\x00')

edited Oct 19, 2022 at 5:09

answered Oct 19, 2022 at 4:50

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

D.J. Elkind Over a year ago

Let's assume there isnt any rubbish bytes, all unused bytes are \x00--is it possible to use something like np.ctypeslib.as_array(int_ptr, shape=(int_count,))? The advantage is that we use the memory as it is, not a single copy operation. This works for integers. The tricky part is that, I need to let numpy know that each element is 16 bytes long.

juanpa.arrivillaga Over a year ago

@D.J.Elkind hmmm I hesitate to suggest this, but I believe using the np.ndarray constructor directly (which it is not meant to be), np.ndarray(shape=(2,), buffer=raw, dtype='S16') would use the underlying buffer without copying it.

juanpa.arrivillaga Over a year ago

@D.J.Elkind although, actually, np.frombuffer is not making a copy of the underlying data, I believe. Indeed, try to assign to the array created from a bytes object and it complains that you are trying to write to a read-only destination. If you used a bytearray instead, you see the changes in both!

D.J. Elkind Over a year ago

why dont you suggest this? Numpy's document says it is inadvisable or it may risk leaking memory? In my case, the memory pointed by byte_arr is solely prepared by C program to be used by Numpy, so it wont become corrupt a while later because other functions modify it.

D.J. Elkind Over a year ago

actually, np.frombuffer is not making a copy of the underlying data -> oh really, let me try as well

|

Collectives™ on Stack Overflow

Split a long byte array into numpy array of strings

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related