Why new String(bytes, enc).getBytes(enc) does not return the original byte array?

Question

I made the following "simulation":

byte[] b = new byte[256];

for (int i = 0; i < 256; i ++) {
    b[i] = (byte) (i - 128);
}
byte[] transformed = new String(b, "cp1251").getBytes("cp1251");

for (int i = 0; i < b.length; i ++) {
    if (b[i] != transformed[i]) {
        System.out.println("Wrong : " + i);
    }
}

For cp1251 this outputs only one wrong byte - at position 25.
For KOI8-R - all fine.
For cp1252 - 4 or 5 differences.

What is the reason for this and how can this be overcome?

I know it is wrong to represent byte arrays as strings in whatever encoding, but it is a requirement of the protocol of a payment provider, so I don't have a choice.

Update: representing it in ISO-8859-1 works, and I'll use it for the byte[] part, and cp1251 for the textual part, so the question remains only out of curiousity

lexicore · Accepted Answer · 2010-03-30 12:24:55Z

13

Some of the "bytes" are not supported in the target set - they are replaced with the ? character. When you convert back, ? is normally converted to the byte value 63 - which isn't what it was before.

answered Mar 30, 2010 at 12:24

lexicore

43.9k17 gold badges142 silver badges228 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

John K Over a year ago

Awesome. I was actually looking for the answer in .NET but they are both similar enough in behaviour that I gleaned it from this. Thanks.

Michael Borgwardt · Accepted Answer · 2010-03-30 12:27:07Z

8

What is the reason for this

The reason is that character encodings are not necesarily bijective and there is no good reason to expect them to be. Not all bytes or byte sequences are legal in all encodings, and usually illegal sequences are decoded to some sort of placeholder character like '?' or U+FFFD, which of course does not produce the same bytes when re-encoded.

Additionally, some encodings may map some legal different byte sequences to the same string.

answered Mar 30, 2010 at 12:27

Michael Borgwardt

347k81 gold badges491 silver badges726 bronze badges

Comments

Stephen C · Accepted Answer · 2010-03-30 12:43:01Z

It appears that both cp1251 and cp1252 have byte values that do not correspond to defined characters; i.e. they are "unmappable".

The javadoc for String(byte[], String) says this:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified. The CharsetDecoder class should be used when more control over the decoding process is required.

Other constructors say this:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string.

If you see this kind of thing happening in practice it indicates that either you are using the wrong character set, or you've been given some bad data. Either way, it is probably not a good idea to carry on as if there was no problem.

I've been trying to figure out if there is a way to get a CharsetDecoder to "preserve" unmappable characters, and I don't think it is possible unless you are willing to implementing a custom decoder/encoder pair. But I've also concluded that it does not make sense to even try. It is (theoretically) wrong map those unmappable characters to real Unicode code points. And if you do, how is your application going to handle them?

Thomas Pornin · Accepted Answer · 2010-03-30 12:28:33Z

3

Actually there shall be one difference: a byte of value 24 is converted to a char of value 0xFFFD; that's the "Unicode replacement character", used for untranslatable bytes. When converted back, you get a question mark (value 63).

In CP1251, the code 24 means "end of input" and cannot be part of a proper string, which is why Java deems it as "untranslatable".

answered Mar 30, 2010 at 12:28

Thomas Pornin

74.9k15 gold badges152 silver badges191 bronze badges

Comments

Miklos Csuka · Accepted Answer · 2010-03-30 12:32:49Z

2

Historical reason: in the ancient character encodings (EBCDIC, ASCII) the first 32 codes have special 'control' meaning and they may not map to readable characters. Examples: backspace, bell, carriage return. Newer character encoding standards usually inherit this and they don't define Unicode characters for every one of the first 32 positions. Java characters are Unicode.

answered Mar 30, 2010 at 12:32

Miklos Csuka

1,5099 silver badges12 bronze badges

Collectives™ on Stack Overflow

Why new String(bytes, enc).getBytes(enc) does not return the original byte array?

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related