18

There is an inconsistency when creating a String with UTF-8 encoding.

Run this code:

public static void encodingIssue() throws IOException {
    byte[] array = new byte[3];
    array[0] = (byte) -19;
    array[1] = (byte) -69;
    array[2] = (byte) -100;

    String str = new String(array, "UTF-8");
    for (char c : str.toCharArray()) {
        System.out.println((int) c);
    }
}

On Java 1.8.0_20 (and earlier versions) we have the result

 65533

On Java 1.7 and 1.6 we have the correct result:

 57052

Have you encountered this error? Is there a workaround for this?

This inconsistency manifests itself also for Shift_JIS, JIS_X0212-1990, x-IBM300, x-IBM834, x-IBM942, x-IBM942C, x-JIS0208, but obviously UTF-8 is the more urgent.

3 Answers 3

11

It is a property of the Modified UTF-8 encoding to store surrogate pairs (or even unpaired chars of that range) like individual characters. And it’s an error if a decoder claiming to use standard UTF-8 uses “Modified UTF-8”. This seems to have been fixed with Java 8.

You can reliably read such data using a method that is specified to use “Modified UTF-8”:

ByteBuffer bb=ByteBuffer.allocate(array.length+2);
bb.putShort((short)array.length).put(array);
ByteArrayInputStream bis=new ByteArrayInputStream(bb.array());
DataInputStream dis=new DataInputStream(bis);
String str=dis.readUTF();
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the response. Indeed, I use a legacy code that modifies the UTF-8 standard to work around a Java 1.3 bug : developer.java.sun.com/developer/bugParade/bugs/4251997.html. Your code snippet fixed my issue
6

The value received in Java 1.6/1.7 is U+DEDC (a low surrogate.)

From RFC 3629:

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

...text elided...

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.

Java 8 decodes this to U+FFFD (REPLACEMENT CHARACTER). This looks like a bug that was fixed in Java 8.

1 Comment

Thanks a lot for your response, it helped me a lot understanding the issue. I fixed it by using the Modified UTF-8 to decode the bytes. As this is a legacy code, I have to keep backward compatibility.
3

That is a surrogate, right? I'm not a Unicode expert, but I don't think it has meaning by itself. Java 8 changed to support Unicode 6.2. Maybe it's stricter about this. 65533 is the standard 0xFFFD replacement character, which means, "not representable". Is there a real case where you need to interpret this as a string? because it seems like Unicode is saying that doesn't make sense as a character anymore.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.