0

Usually, when creating an numpy array of strings, we can do something like

import numpy as np
np.array(["Hello world!", "good bye world!", "whatever world"])
>>> array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15')

Now the question is, I am given a long bytearray from a foreign C function like this:

b'Hello world!\x00<some rubbish bytes>good bye world!\x00<some rubbish bytes>whatever world\x00<some rubbish bytes>'

It is guaranteed that every 32 bytes is a null-terminated string (i.e., there is a \x00 byte appended to the valid part of the string) and I need to convert this long bytearray to something like this, array(['Hello world!', 'good bye world!', 'whatever world'], dtype='<U15'), preferably in-place (i.e., no memory copy).

This is what I do now:

for i in range(str_count):
    str_arr[i] = byte_arr[i * 32: (i+1) * 32].split(b'\x00')[0].decode('utf-8')
str_arr_np = np.array(str_arr),

It works, but it is kind of awkward and not done in-place (bytes are copied at least once, if not twice). Are there any better approaches?

10
  • 1
    can you please give a real example input with the real, expected output? How are we supposed to distinguish "rubbish bytes"? Commented Oct 19, 2022 at 4:29
  • 1
    So every 32 bytres there is the beginning of some null-terminated string which can potentially take up to 32 bytes (but possibly less)? Again, please just give a real example Commented Oct 19, 2022 at 4:34
  • 1
    Do you have access to the C side of the transform? It's easy enough to null-out the 32 byte buffer(s) (in C) before filling each with up to 31 characters (plus at least one terminating '\0')... Commented Oct 19, 2022 at 4:42
  • 1
    Hey @Fe2O3, based on your hint another user does provide a better approach! In my initial design, as I didnt know how it could help, so I let the memory content as indeterminate. But it can be memset()'ed to before use. Commented Oct 19, 2022 at 5:00
  • 1
    @Fe2O3 it wont, but it assumes that the buffer size is a mulitple of the item size (and will complain if it isn't) Commented Oct 19, 2022 at 5:11

1 Answer 1

2

If you can zero out the data on the C side, then you can use np.frombuffer and it will be about as efficient as you can reasonably expect:

So, if you can zero out the data, then this can be read using numpy.frombuffer and it will probably be as efficient as you can reasonably expect to get:

>>> raw = b'hello world\x00\x00\x00\x00\x00Good Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(raw, dtype='S16')
array([b'hello world', b'Good Bye'], dtype='|S16')

Of course, this gives you a bytes string, not unicode string, although, that may be desirable in your case.

Note, the above relies on the built-in behavior of stripping trailing null bytes, if you have garbage afterwards, it won't work:

>>> data = b'hello world\x00aaaaGood Bye\x00\x00\x00\x00\x00\x00\x00\x00'
>>> np.frombuffer(data, dtype='S16')
array([b'hello world\x00aaaa', b'Good Bye'], dtype='|S16')

Note, this shouldn't make a copy, notice:

>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr
array([b'hello world', b'Good Bye'], dtype='|S16')
>>> arr[0] = b"z"*16
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: assignment destination is read-only

However, if the destination is not read-only, so say you had a bytearray to begin with:

>>> raw = bytearray(raw)
>>> arr = np.frombuffer(raw, dtype='S16')
>>> arr[0] = b"z"*16
>>> arr
array([b'zzzzzzzzzzzzzzzz', b'Good Bye'], dtype='|S16')
>>> raw
bytearray(b'zzzzzzzzzzzzzzzzGood Bye\x00\x00\x00\x00\x00\x00\x00\x00')
Sign up to request clarification or add additional context in comments.

6 Comments

Let's assume there isnt any rubbish bytes, all unused bytes are \x00--is it possible to use something like np.ctypeslib.as_array(int_ptr, shape=(int_count,))? The advantage is that we use the memory as it is, not a single copy operation. This works for integers. The tricky part is that, I need to let numpy know that each element is 16 bytes long.
@D.J.Elkind hmmm I hesitate to suggest this, but I believe using the np.ndarray constructor directly (which it is not meant to be), np.ndarray(shape=(2,), buffer=raw, dtype='S16') would use the underlying buffer without copying it.
@D.J.Elkind although, actually, np.frombuffer is not making a copy of the underlying data, I believe. Indeed, try to assign to the array created from a bytes object and it complains that you are trying to write to a read-only destination. If you used a bytearray instead, you see the changes in both!
why dont you suggest this? Numpy's document says it is inadvisable or it may risk leaking memory? In my case, the memory pointed by byte_arr is solely prepared by C program to be used by Numpy, so it wont become corrupt a while later because other functions modify it.
actually, np.frombuffer is not making a copy of the underlying data -> oh really, let me try as well
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.