4

I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files.

Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file).

How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make.

4
  • Are you certain that stringLength is characters and not bytes? Commented Mar 4, 2013 at 10:44
  • 1
    wow, that'd be a really terrible format. Do you have the data already read into a string or list of some sort? UTF-8 bytes can be inspected easily enough to determine how many bytes follow to make a character, but you need to process these character-by-decoded-character. Commented Mar 4, 2013 at 10:48
  • @GrahamBorland 100%? No, I have yet to find a file that actually uses multibyte characters, but it is my interpretation of the specification that this is the case. Commented Mar 4, 2013 at 11:00
  • @MartijnPieters Okay, how do I do that in Python? Is there a convenient module I can use? Commented Mar 4, 2013 at 11:01

2 Answers 2

6

Given a file object, and a number of characters, you can use:

# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
    _lead_byte_to_count.append(
        1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)

def readUTF8(f, count):
    """Read `count` UTF-8 bytes from file `f`, return as unicode"""
    # Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
    res = []
    while count:
        count -= 1
        lead = f.read(1)
        res.append(lead)
        readcount = _lead_byte_to_count[ord(lead)]
        if readcount:
            res.append(f.read(readcount))
    return (''.join(res)).decode('utf8')

Result of a test:

>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'

In Python 3, it is of course much, much easier to just wrap the file object in a io.TextIOWrapper() object and leave decoding to the native and efficient Python UTF-8 implementation.

Sign up to request clarification or add additional context in comments.

16 Comments

This is exactly what I'm looking for. Accepted and upvoted. Is it enough that I link to your StackOverflow profile when attributing that section to you?
@Surma: Sure; all content of this site is licensed as CC-wiki (see bottom right) but the readcount 'function' was adapted from a simple C macro, so I was reusing stuff too. :-) All in all, this is simple stuff once you understand the underlying byte formats.
@Surma: There are also more pythonic ways to determine the readcount value; they may even be faster than what I used here. This one uses between 2 and 4 (simple) tests per byte.
@Surma: Updated: moved to using a table instead, so that your inner loop only has to do one lookup per lead byte.
@Sovetnikov: just use bytes(res).decode(). Or, don't handle raw UTF-8 bytes yourself and use the standard library io.TextIOWrapper() object.
|
0

One character in UTF-8 can be 1byte,2bytes,3byte3.

If you have to read your file byte by byte, you have to follow the UTF-8 encoding rules. http://en.wikipedia.org/wiki/UTF-8

Most the time, you can just set the encoding to utf-8, and read the input stream.

You do not need to care how much bytes you have read.

1 Comment

I googled setting input stream encoding and got the docs on the codecs module. If I understand this correctly, I could do something like this: strLen = struct.unpack('>h', f.read(2)) utfStream = codecs.open(f, 'r', 'utf-8') string = utfStream.read(strLen) One question though: Will this advance the pointer in my file descriptor, so that subsequent read()'s on f will return bytes after the string I just read? Edit: Where did the newlines in my code example go?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.