Reading UTF-8 strings from a binary file

Question

I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files.

Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file).

How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make.

Are you certain that stringLength is characters and not bytes? — Graham Borland
– Graham Borland, Commented Mar 4, 2013 at 10:44
wow, that'd be a really terrible format. Do you have the data already read into a string or list of some sort? UTF-8 bytes can be inspected easily enough to determine how many bytes follow to make a character, but you need to process these character-by-decoded-character. — Martijn Pieters
– Martijn Pieters, Commented Mar 4, 2013 at 10:48
@GrahamBorland 100%? No, I have yet to find a file that actually uses multibyte characters, but it is my interpretation of the specification that this is the case. — Surma
– Surma, Commented Mar 4, 2013 at 11:00
@MartijnPieters Okay, how do I do that in Python? Is there a convenient module I can use? — Surma
– Surma, Commented Mar 4, 2013 at 11:01

Martijn Pieters · Accepted Answer · 2021-07-08 17:04:27Z

6

Given a file object, and a number of characters, you can use:

# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
    _lead_byte_to_count.append(
        1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)

def readUTF8(f, count):
    """Read `count` UTF-8 bytes from file `f`, return as unicode"""
    # Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
    res = []
    while count:
        count -= 1
        lead = f.read(1)
        res.append(lead)
        readcount = _lead_byte_to_count[ord(lead)]
        if readcount:
            res.append(f.read(readcount))
    return (''.join(res)).decode('utf8')

Result of a test:

>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'

In Python 3, it is of course much, much easier to just wrap the file object in a io.TextIOWrapper() object and leave decoding to the native and efficient Python UTF-8 implementation.

edited Jul 8, 2021 at 17:04

answered Mar 4, 2013 at 11:23

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

Surma Over a year ago

This is exactly what I'm looking for. Accepted and upvoted. Is it enough that I link to your StackOverflow profile when attributing that section to you?

Martijn Pieters Over a year ago

@Surma: Sure; all content of this site is licensed as CC-wiki (see bottom right) but the readcount 'function' was adapted from a simple C macro, so I was reusing stuff too. :-) All in all, this is simple stuff once you understand the underlying byte formats.

Martijn Pieters Over a year ago

@Surma: There are also more pythonic ways to determine the readcount value; they may even be faster than what I used here. This one uses between 2 and 4 (simple) tests per byte.

Martijn Pieters Over a year ago

@Surma: Updated: moved to using a table instead, so that your inner loop only has to do one lookup per lead byte.

Martijn Pieters Over a year ago

@Sovetnikov: just use bytes(res).decode(). Or, don't handle raw UTF-8 bytes yourself and use the standard library io.TextIOWrapper() object.

|

neohope · Accepted Answer · 2013-03-04 10:51:30Z

0

One character in UTF-8 can be 1byte,2bytes,3byte3.

If you have to read your file byte by byte, you have to follow the UTF-8 encoding rules. http://en.wikipedia.org/wiki/UTF-8

Most the time, you can just set the encoding to utf-8, and read the input stream.

You do not need to care how much bytes you have read.

answered Mar 4, 2013 at 10:51

neohope

1,83216 silver badges29 bronze badges

1 Comment

Surma Over a year ago

I googled setting input stream encoding and got the docs on the codecs module. If I understand this correctly, I could do something like this: strLen = struct.unpack('>h', f.read(2)) utfStream = codecs.open(f, 'r', 'utf-8') string = utfStream.read(strLen) One question though: Will this advance the pointer in my file descriptor, so that subsequent read()'s on f will return bytes after the string I just read? Edit: Where did the newlines in my code example go?

Collectives™ on Stack Overflow

Reading UTF-8 strings from a binary file

2 Answers 2

16 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

16 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related