Java modified UTF-8 strings in Python

Question

I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain UTF-8 strings. Java uses a modified UTF-8 encoding in DataInputStream.readUTF() which is not supported by Python (yet at least)

Can anybody point me in the right direction to construct Java modified UTF-8 strings in Python?

Update #1: To see a little more about the Java modified UTF-8, check out the readUTF() method from the DataInput interface on line 550 here, or here in the Java SE docs.

Update #2: I am trying to interface with a third-party JBoss web app which is using this modified UTF-8 format to read in strings via POST requests by calling DataInputStream.readUTF() (sorry for any confusion regarding normal Java UTF-8 string operation).

What do you mean by "modified UTF-8"? As far as I'm aware Java uses an entirely standard UTF-8 if you ask it to encode to UTF-8. Note that Java's native string format is UTF-16 though. — Jon Skeet
– Jon Skeet, Commented Sep 8, 2009 at 9:41
Hi Jon, I added a link to the readUTF method in the DataInput interface which mentions it a little. I'll try to dig up some more info. — QAZ
– QAZ, Commented Sep 8, 2009 at 9:46
There is some info on Wikipedia: en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 (so, serialization, some JNI and in-class string constants). — McDowell
– McDowell, Commented Sep 8, 2009 at 9:47
I would suggest modifying the Java application to use real UTF-8. — Tom Hawtin - tackline
– Tom Hawtin - tackline, Commented Sep 8, 2009 at 9:50
Thanks McDowell, I am trying to interface with a JBoss web app which is using this modified utf8 format to read in strings via POST requests. — QAZ
– QAZ, Commented Sep 8, 2009 at 9:50

ZZ Coder · Accepted Answer · 2009-09-08 12:17:39Z

4

You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8. On the Python side, you can just handle it like this,

Convert the string into normal UTF-8 and stores bytes in a buffer.
Write the 2-byte buffer length (not the string length) as binary in big-endian.
Write the whole buffer.

I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5).

MUTF-8 is mainly used for JNI and other systems with null-terminated strings. The only difference from normal UTF-8 is how U+0000 is encoded. Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80). First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text. Secondly, DataInputStream.readUTF() doesn't enforce the encoding so it happily accepts either one.

EDIT: The Python code should look like this,

def writeUTF(data, str):
    utf8 = str.encode('utf-8')
    length = len(utf8)
    data.append(struct.pack('!H', length))
    format = '!' + str(length) + 's'
    data.append(struct.pack(format, utf8))

edited Sep 8, 2009 at 12:17

answered Sep 8, 2009 at 11:55

ZZ Coder

75.7k30 gold badges139 silver badges169 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Cogwheel Over a year ago

U+0000 isn't the only difference. For code points that would be represented with surrogate pairs in UTF-16, modified UTF-8 encodes each component of the pair as if they were separate UTF-8 code points. This is pretty horrible because it means you have to convert from "modified UTF-8" to UTF-16, and then back in order to encode the correct code point.

Chris Stryczynski Over a year ago

I don't think you can ignore it: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 10: invalid start byte

bam · Accepted Answer · 2017-12-30 19:05:16Z

3

I know this question is very very old, but I still want to contribute, since I got in the same problem and solved it

I found the implementation of this modified utf8 in the openjdk sources and translated it to python. here is a link to the gist i created.

answered Dec 30, 2017 at 19:05

bam

414 bronze badges

Comments

Jon Skeet · Accepted Answer · 2009-09-08 09:54:37Z

1

Okay, if you need to read the format of DataInput.readUTF, I suspect you'll just have to convert the (well-documented) format into Python.

It doesn't look like it would be particularly hard to do. After reading the length and then the binary data itself, I suggest you use a first pass to work out how many Unicode characters will be in the output, then construct a string accordingly in a second pass. Without knowing Python I don't know the ins and outs of how to efficiently construct a string, but given the linked specification I can't imagine it would be very hard. You might want to look at the source for the existing UTF-8 decoder as a starting point.

answered Sep 8, 2009 at 9:54

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Comments

Epoc · Accepted Answer · 2024-08-27 09:35:44Z

1

There's a Python package that handles both reading and writing MUTF-8 strings with optional C extension: https://github.com/TkTech/mutf8

from mutf8 import encode_modified_utf8, decode_modified_utf8

unicode = decode_modified_utf8(byte_like_object)
bytes_ = encode_modified_utf8(unicode)

edited Aug 27, 2024 at 9:35

answered Sep 8, 2021 at 8:47

Epoc

7,5658 gold badges65 silver badges69 bronze badges

Comments

Olivier 'Ölbaum' Scherler · Accepted Answer · 2009-09-08 09:58:45Z

0

Maybe this can help you, although it looks like it's the reverse of what you're doing:

Connecting a Java applet to a python SocketServer

answered Sep 8, 2009 at 9:58

Olivier 'Ölbaum' Scherler

2,14615 silver badges17 bronze badges

Collectives™ on Stack Overflow

Java modified UTF-8 strings in Python

5 Answers 5

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related