8

I am interfacing with a Java application via Python. I need to be able to construct byte sequences which contain UTF-8 strings. Java uses a modified UTF-8 encoding in DataInputStream.readUTF() which is not supported by Python (yet at least)

Can anybody point me in the right direction to construct Java modified UTF-8 strings in Python?

Update #1: To see a little more about the Java modified UTF-8, check out the readUTF() method from the DataInput interface on line 550 here, or here in the Java SE docs.

Update #2: I am trying to interface with a third-party JBoss web app which is using this modified UTF-8 format to read in strings via POST requests by calling DataInputStream.readUTF() (sorry for any confusion regarding normal Java UTF-8 string operation).

8
  • 1
    What do you mean by "modified UTF-8"? As far as I'm aware Java uses an entirely standard UTF-8 if you ask it to encode to UTF-8. Note that Java's native string format is UTF-16 though. Commented Sep 8, 2009 at 9:41
  • 1
    Hi Jon, I added a link to the readUTF method in the DataInput interface which mentions it a little. I'll try to dig up some more info. Commented Sep 8, 2009 at 9:46
  • 2
    There is some info on Wikipedia: en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 (so, serialization, some JNI and in-class string constants). Commented Sep 8, 2009 at 9:47
  • 1
    I would suggest modifying the Java application to use real UTF-8. Commented Sep 8, 2009 at 9:50
  • Thanks McDowell, I am trying to interface with a JBoss web app which is using this modified utf8 format to read in strings via POST requests. Commented Sep 8, 2009 at 9:50

5 Answers 5

4

You can ignore Modified UTF-8 Encoding (MUTF-8) and just treat it as UTF-8. On the Python side, you can just handle it like this,

  1. Convert the string into normal UTF-8 and stores bytes in a buffer.
  2. Write the 2-byte buffer length (not the string length) as binary in big-endian.
  3. Write the whole buffer.

I've done this in PHP and Java didn't complain about my encoding at all (at least in Java 5).

MUTF-8 is mainly used for JNI and other systems with null-terminated strings. The only difference from normal UTF-8 is how U+0000 is encoded. Normal UTF-8 use 1 byte encoding (0x00) and MUTF-8 uses 2 bytes (0xC0 0x80). First of all, you shouldn't have U+0000 (an invalid codepoint) in any Unicode text. Secondly, DataInputStream.readUTF() doesn't enforce the encoding so it happily accepts either one.

EDIT: The Python code should look like this,

def writeUTF(data, str):
    utf8 = str.encode('utf-8')
    length = len(utf8)
    data.append(struct.pack('!H', length))
    format = '!' + str(length) + 's'
    data.append(struct.pack(format, utf8))
Sign up to request clarification or add additional context in comments.

2 Comments

U+0000 isn't the only difference. For code points that would be represented with surrogate pairs in UTF-16, modified UTF-8 encodes each component of the pair as if they were separate UTF-8 code points. This is pretty horrible because it means you have to convert from "modified UTF-8" to UTF-16, and then back in order to encode the correct code point.
I don't think you can ignore it: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 10: invalid start byte
3

I know this question is very very old, but I still want to contribute, since I got in the same problem and solved it

I found the implementation of this modified utf8 in the openjdk sources and translated it to python. here is a link to the gist i created.

Comments

1

Okay, if you need to read the format of DataInput.readUTF, I suspect you'll just have to convert the (well-documented) format into Python.

It doesn't look like it would be particularly hard to do. After reading the length and then the binary data itself, I suggest you use a first pass to work out how many Unicode characters will be in the output, then construct a string accordingly in a second pass. Without knowing Python I don't know the ins and outs of how to efficiently construct a string, but given the linked specification I can't imagine it would be very hard. You might want to look at the source for the existing UTF-8 decoder as a starting point.

Comments

1

There's a Python package that handles both reading and writing MUTF-8 strings with optional C extension: https://github.com/TkTech/mutf8

from mutf8 import encode_modified_utf8, decode_modified_utf8

unicode = decode_modified_utf8(byte_like_object)
bytes_ = encode_modified_utf8(unicode)

Comments

0

Maybe this can help you, although it looks like it's the reverse of what you're doing:

Connecting a Java applet to a python SocketServer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.