Java 8 UTF-8 encoding issue (java bug?)

Question

There is an inconsistency when creating a String with UTF-8 encoding.

Run this code:

public static void encodingIssue() throws IOException {
    byte[] array = new byte[3];
    array[0] = (byte) -19;
    array[1] = (byte) -69;
    array[2] = (byte) -100;

    String str = new String(array, "UTF-8");
    for (char c : str.toCharArray()) {
        System.out.println((int) c);
    }
}

On Java 1.8.0_20 (and earlier versions) we have the result

On Java 1.7 and 1.6 we have the correct result:

Have you encountered this error? Is there a workaround for this?

This inconsistency manifests itself also for Shift_JIS, JIS_X0212-1990, x-IBM300, x-IBM834, x-IBM942, x-IBM942C, x-JIS0208, but obviously UTF-8 is the more urgent.

Holger · Accepted Answer · 2014-08-20 12:41:00Z

11

It is a property of the “Modified UTF-8” encoding to store surrogate pairs (or even unpaired chars of that range) like individual characters. And it’s an error if a decoder claiming to use standard UTF-8 uses “Modified UTF-8”. This seems to have been fixed with Java 8.

You can reliably read such data using a method that is specified to use “Modified UTF-8”:

ByteBuffer bb=ByteBuffer.allocate(array.length+2);
bb.putShort((short)array.length).put(array);
ByteArrayInputStream bis=new ByteArrayInputStream(bb.array());
DataInputStream dis=new DataInputStream(bis);
String str=dis.readUTF();

answered Aug 20, 2014 at 12:41

Holger

301k43 gold badges481 silver badges827 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Stefan Bulzan Over a year ago

Thanks for the response. Indeed, I use a legacy code that modifies the UTF-8 standard to work around a Java 1.3 bug : developer.java.sun.com/developer/bugParade/bugs/4251997.html. Your code snippet fixed my issue

Community · Accepted Answer · 2021-10-07 05:51:50Z

6

The value received in Java 1.6/1.7 is U+DEDC (a low surrogate.)

From RFC 3629:

The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

...text elided...

Implementations of the decoding algorithm above MUST protect against decoding invalid sequences. For instance, a naive implementation may decode the overlong UTF-8 sequence C0 80 into the character U+0000, or the surrogate pair ED A1 8C ED BE B4 into U+233B4. Decoding invalid sequences may have security consequences or cause other problems.

Java 8 decodes this to U+FFFD (REPLACEMENT CHARACTER). This looks like a bug that was fixed in Java 8.

edited Oct 7, 2021 at 5:51

CommunityBot

11 silver badge

answered Aug 20, 2014 at 12:35

McDowell

109k31 gold badges207 silver badges272 bronze badges

1 Comment

Stefan Bulzan Over a year ago

Thanks a lot for your response, it helped me a lot understanding the issue. I fixed it by using the Modified UTF-8 to decode the bytes. As this is a legacy code, I have to keep backward compatibility.

Sean Owen · Accepted Answer · 2014-08-20 12:29:11Z

3

That is a surrogate, right? I'm not a Unicode expert, but I don't think it has meaning by itself. Java 8 changed to support Unicode 6.2. Maybe it's stricter about this. 65533 is the standard 0xFFFD replacement character, which means, "not representable". Is there a real case where you need to interpret this as a string? because it seems like Unicode is saying that doesn't make sense as a character anymore.

answered Aug 20, 2014 at 12:29

Sean Owen

67k23 gold badges144 silver badges175 bronze badges

Collectives™ on Stack Overflow

Java 8 UTF-8 encoding issue (java bug?)

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related