2

I have a numpy bytes array containing characters, followed by b'', followed by others characters (including weird characters which raise Unicode errors when decoding):

bytes = numpy.array([b'f', b'o', b'o', b'', b'b', b'a', b'd', b'\xfe', b'\x95', b'', b'\x80', b'\x04', b'\x08' b'\x06'])

I want to get everything before the first b''.

Currently my code is:

txt = []
for c in bytes:
    if c != b'':
        txt.append(c.decode('utf-8'))
    else:
        break
txt = ''.join(txt)

I suppose there is a more efficient and Pythonic way to do that.

1

1 Answer 1

4

I like your way, it is explicit, the for loop is understandable by all and it isn't all that slow compared to other approaches.

Some suggestions I'd make would be to change your condition from if c != b'' to if c since a non-empty byte object will be truthy and, *don't name your list bytes, you mask the built-in! Name it bt or something similar :-)

Other options include itertools.takewhile which will grab elements from an iterable as long as a predicate holds; your operation would look like:

"".join(s.decode('utf-8') for s in takewhile(bool, bt))

This is slightly slower but is more compact, if you're a one-liner lover this might appeal to you.

Slightly faster and also compact is using index along with a slice:

"".join(b.decode('utf-8') for b in bt[:bt.index(b'')])

While compact it also suffers from readability.

In short, I'd go with the for loop since readability counts as very pythonic in my eyes.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for this advices! Oh in fact the byte array was a numpy array. I like your second solution, but I benchmarked the these 3 solutions (with ba[:np.where(ba == b'')[0][0]] instead of ba[:ba.index(b'')]) and it appears that the for loop solution is faster, so I choosed it.
@user2914540 oh I was unaware that it was a numpy array, maybe add the numpy tag and specify that bytes is a numpy array? There might be more efficient ways to do this in numpy.
done. Sorry, this array comes from an external library (netcdf4py) and I discovered it was a numpy array by trying to do ab.index().

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.