I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files.
Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file).
How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make.