Java String.getBytes(charset) and new String(bytes, charset) with two different character sets

Question

As far as I know, in String.getBytes(charset), the argument, charset means that the method returns bytes of a string encoded as the given charset.

In new String(bytes, charset), the second argument, charset means that the method decodes bytes as the given charset and returns the decoded result.

According to the above, and as my understanding, the charset arguments of the two different methods must be the same so that new String(bytes, charset) can return a proper string. (I guess here is what I'm missing.)

I have an incorrectly decoded string and I tested the following code with this:

String originalStr = "Å×½ºÆ®"; // 테스트 
String [] charSet = {"utf-8","euc-kr","ksc5601","iso-8859-1","x-windows-949"};

for (int i=0; i<charSet.length; i++) {
 for (int j=0; j<charSet.length; j++) {
  try {
   System.out.println("[" + charSet[i] +"," + charSet[j] +"] = " + new String(originalStr.getBytes(charSet[i]), charSet[j]));
  } catch (UnsupportedEncodingException e) {
   e.printStackTrace();
  }
 }
}

The output is:

[utf-8,utf-8] = Å×½ºÆ®
[utf-8,euc-kr] = ��쩍쨘�짰
[utf-8,ksc5601] = ��쩍쨘�짰
[utf-8,iso-8859-1] = Ã…Ã—Â½ÂºÃ†Â®
[utf-8,x-windows-949] = 횇횞쩍쨘횈짰
[euc-kr,utf-8] = ?����������
[euc-kr,euc-kr] = ?×½ºÆ®
[euc-kr,ksc5601] = ?×½ºÆ®
[euc-kr,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[euc-kr,x-windows-949] = ?×½ºÆ®
[ksc5601,utf-8] = ?����������
[ksc5601,euc-kr] = ?×½ºÆ®
[ksc5601,ksc5601] = ?×½ºÆ®
[ksc5601,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[ksc5601,x-windows-949] = ?×½ºÆ®
[iso-8859-1,utf-8] = �׽�Ʈ
[iso-8859-1,euc-kr] = 테스트
[iso-8859-1,ksc5601] = 테스트
[iso-8859-1,iso-8859-1] = Å×½ºÆ®
[iso-8859-1,x-windows-949] = 테스트
[x-windows-949,utf-8] = ?����������
[x-windows-949,euc-kr] = ?×½ºÆ®
[x-windows-949,ksc5601] = ?×½ºÆ®
[x-windows-949,iso-8859-1] = ?¡¿¨ö¨¬¨¡¢ç
[x-windows-949,x-windows-949] = ?×½ºÆ®

As you can see, I figure out the way of getting the original string:

[iso-8859-1,euc-kr] = 테스트  
[iso-8859-1,ksc5601] = 테스트  
[iso-8859-1,x-windows-949] = 테스트

How can it be possible? How can the string be encoded and decoded properly as different character sets?

Holger · Accepted Answer · 2019-03-15 12:54:36Z

According to the above, and as my understanding, the charset arguments of the two different methods must be the same so that new String(bytes, charset) can return a proper string.

That’s what you should aim at, to write correct code. But this does not imply that every wrong operation will always produce wrong results. A simple example would be a string consisting of ASCII letters only. A lot of encodings produce the same byte sequence for such a string, so a test using only such a string is not sufficient to spot encoding related errors.

As you can see, I figure out the way of getting the original string:
[iso-8859-1,euc-kr] = 테스트  
[iso-8859-1,ksc5601] = 테스트  
[iso-8859-1,x-windows-949] = 테스트 
How can it be possible? How can the string be encoded and decoded properly as different character sets?

Well, when I execute

System.out.println(Charset.forName("euc-kr") == Charset.forName("ksc5601"));

on my machine, it prints true. Or, if I execute

System.out.println(Charset.forName("euc-kr").aliases());

it prints

[ksc5601-1987, csEUCKR, ksc5601_1987, ksc5601, 5601, euc_kr, ksc_5601, ks_c_5601-1987, euckr]

So for euc-kr and ksc5601, the answer is simple. These are different names for the same character encoding.

For x-windows-949, I have to resort to Wikipedia:

Unified Hangul Code (UHC), or Extended Wansung, also known under Microsoft Windows as Code Page 949 (Windows-949, MS949 or ambiguously CP949), is the Microsoft Windows code page for the Korean language. It is an extension of Wansung Code (KS C 5601:1987, encoded as EUC-KR) to include all 11172 Hangul syllables present in Johab (KS C 5601:1992 annex 3).

So it is an extension of ksc5601 which will lead to the same result, as long as you’re not using any characters affacted by the extension (think of the ASCII example above).

Generally, this does not invalidate you premise. Correct results are only guaranteed when using the same encoding for both sides. It just means, testing code is much harder, as it requires sufficient test input data to spot errors. E.g. a common error in the Western world, is to confuse iso-latin-1 (ISO 8859-1) with Windows codepage 1252, which may not get spotted with simple text.

Thilo · Accepted Answer · 2019-03-15 10:11:31Z

Java strings are internally (at least in most cases...) stored as UTF-16.
The 255 characters in iso-8859-1 have the same codepoints as their Unicode equivalents
I am assuming you compiled this code with some 8-bit source encoding, and your String literal ended up with all bits intact. Java thinks it has UTF-16 now, but it actually has junk characters, each of them in the range 0x00 to 0xFF.
When you ask Java to write its "UTF-16" out as iso-8859-1 it just writes out all these bytes directly (as the code-points are shared). If you wrote as some other encoding, it would need to convert some of them. If you had any characters outside of the one-byte-range, you would get a ? for them (as they cannot be expressed in iso-8859-1).
So your iso-8859-1 bytes are not iso-8859-1, but they still have your original bits
When you read it back as iso-8859-1 it will remain "junk"
But when you read it back using the Korean encoding that it actually represents, you get the proper text

"Your iso-8859-1 bytes are not iso-8859-1"

Well, if someone did want to write "Å×½ºÆ®" and used iso-8859-1 for it, they would get the exact same bytes you have. So in a way, it is still perfectly valid iso-8859-1. If it was not, Java would put in some ? for the characters that cannot exist in that encoding.

Two things you can try:

set your source code encoding to UTF-8. That should break things (because now it will not keep your bits intact anymore)
set your editor to this Korean encoding. The String literal should look fine.

Tom Blodget · Accepted Answer · 2019-03-16 22:40:41Z

@Holger give an excellent answer to the question as asked. The question is very well stated as a knowledge question that was arrived at during an investigation. Nonetheless, it does seem like an XY Problem.

How does "Å×½ºÆ®" represent "테스트"?

As already discovered, "Å×½ºÆ®" in ISO 8859-1 is the same byte sequence as "테스트" in a few character encodings for the Hangul script:

C5 D7 BD BA C6 AE

There is no text but encoded text.

When communicating text, one must send the bytes along with an understanding of which character encoding was used. So, to communicate 테스트, one would send the bytes C5 D7 BD BA C6 AE along with the understanding they represent text encoded, with, say, Windows-949. This is apparently not what was done.

Sometimes when a sequence of bytes needs to be handled in a text datatype, a byte-to-character scheme is used. One is Base64. It takes 3 bytes at a time and represents them with four characters. When communicating such a usage, both the string and an understanding of both that Base64 is being used and what the bytes are supposed to represent.

Sometimes Base64 is considered wasteful and its property of using only a limited set of printable characters that are present in almost every character set is not valued, a more compact scheme is used. I call it Base256. It takes 1 byte at a time and represents it with one character. It uses the same mapping as the ISO 8859-1 character encoding.

Putting this all together there was a communication failure. The following metadata was missing:

The string "Å×½ºÆ®" represents a byte sequence that can be obtained by "encoding" with ISO 8859-1.
That byte sequence represents text that is encoded with, say, Windows-949.

(I think Base256 is too novel to be productive. Unfortunately, it is not uncommon. Hopefully, it will fall from use.)

Erwin Bolwidt · Accepted Answer · 2019-03-22 02:52:46Z

Your problem is that the initial assumption in your code is incorrect.

You say:

String originalStr = "Å×½ºÆ®"; // 테스트

which is simply not true.

The only correct line is

String originalStr = "테스트"; // 테스트

Your originalString did not contain the characters 테스트. You just found an encodings that, when given the input string Å×½ºÆ®, will send bytes to your terminal that has a particular character encoding that you didn't mention, which results in showing 테스트 .

Fixes: always use a fixed character encoding for your Java source code. The easiest way to to specify it in your pom.xml with:

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>

(or equivalent for different build systems) and use an IDE that understands maven.

Otherwise you need to make sure that you use the same character encoding in your IDE or editor as what you use when compiling your source code. Or alternatively, you can stick to only using Unicode \u escape characters for non-ASCII characters.

Once you have that set up you'll notice that the encoding pairs for which the input:

String originalStr = "테스트";

are the ones that support Korean characters and have the same input and output encoding (barring the ones that are merely aliases for each other such as euc-kr and ksc5601) gives the same output (print both to your console and compare them, or ensure that your console is in the same character set as your Java default character set)

vavasthi · Accepted Answer · 2019-03-15 05:58:38Z

-1

UTF-8 is a variable size character set. The first 128 elements are mapped to English language. As you go higher in the characters, a character in any language can be mapped in up to a maximum of four bytes.

Compared to that, most of othe characters sets are fixed sized character sets, most of them are two byte character sets. Because of this you will see overlap when you are mapping byte stream from one character set into. For example english character 'A' will be represented as 0x41 in UTF-8 and 0x0041 in unicode. So if you take a unicode encoded bytestream and try to decode it as UTF-8 you will find two character, one NUL and then an 'A'.

answered Mar 15, 2019 at 5:58

vavasthi

9525 silver badges15 bronze badges

1 Comment

ParkCheolu Over a year ago

That's not what I asked. I asked about why the string could be decoded properly despite of the difference between the character sets given to the method, getBytes and the String constructor.

Collectives™ on Stack Overflow

Java String.getBytes(charset) and new String(bytes, charset) with two different character sets

5 Answers 5

Comments

Comments

How does "Å×½ºÆ®" represent "테스트"?

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

How does "Å×½ºÆ®" represent "테스트"?

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related