0

I thought the length of output will always keep the same when converting between byte[] and String. But below example shows this is incorret.

byte[] b1 = {55, -71, -35, -35, 83, -115, 107, -80, -62, 86, 98, 125, -68, -12, 14, -92, -122, -65, -117, -26, 80, -102, 75, 49, -120, -10, 18, -8, 82, -21, 49, 80, 125, 94, -35, -66, 91, 79, 77, -29, -48, -85, 29, -48, -118, -13, -84, -77, 93, -101, -7, 46, -44, -25, -42, 72, -33, -81, -120, -40, 40, 65, 58, -74, -34, 99, -8, -118, 83, 110, -94, 69, 21, -27, 114, 43, -23, 7, 120, -15, 21, 110, 108, 98, -99, 7, 107, 63, -48, 32, 123, 35, -36, -35, 7, -75, 40, -3, 33, 92, -79, 119, 22, -63, 27, 123, -98, 92, -93, 30, 51, 55, 106, -109, 99, 123, 25, -111, -53, 66, 117, 121, -20, 6, -10, -34, -76, -120, -56, 123, 48, -9, -116, -81, -47, 67, 80, 14, -58, -17, -92, -75, 119, 27, 125, -115, -31, 114, -96, 126, -87, 98, -108, -21, -113, 36, 104, -69, -74, 41, -68, 115, 103, 106, -39, 10, 0, 7, -66, 84, -94, 46, -1, -62, -115, 104, -104, 53, 86, -117, 15, -100, 46, 7, 57, -84, 40, 118, -12, 93, -6, -31, 28, 81, -72, 123, 54, -76, 123, 111, 54, 121, 126, -19, -32, 99, 109, -68, -103, 29, 75, 57, 115, 33, 110, -23, -116, 11, 112, 117, 67, -100, 21, 94, -16, 94, 24, 47, -90, -48, 30, 15, 24, 98, -114, -96, 37, -47, 32, 74, 110, 58, 35, 77, 62, -74, 94, 59, 63, -35, -59, 10, 43, 65, -63, 59, -65, 58, 69, 88, -91, -58, -103, 88, 6, -105, 92, -9, -19, 26, 5, -42, -38, -82, -56, 42, -45, 30, 103, -113, -64, -82, 29, 6, 40, 102, 44, 59, 51, -69, -70, 90, -126, 40, -105, 103, 92, 124, 120, 43, -53, 73, -109, 103, -62, -64, -68, -81, -61, -68, -73, -6, -112, 85, 119, -92, -85, -31, -37, 32, -2, 100, 34, 41, -128, 73, -92, -94, 71, 98, 0, 126, -98, -51, -8, -72, -97, 66, -71, -14, -74, -39, 56, 71, 46, -94, 40, 32, -84, -17, -128, 60, 25, 75, -104, 25, 49, -14, -103, -89, 97, -61, 89, -109, 118, 114, 123, -38, 101, 98, 7, 70, 9, 42, 98, -94, 73, -70, 72, 43, 52, -89, -20, -22, -58, -109, -88, 36, 118, 71, -34, -85, -24, -46, -120, -118, 5, -118, -53, -5, -87, -116, -38, 101, 74, -111, -2, 12, 48, -105, -110, 6, -114, 31, 70, -42, -118, -61, 82, 83, -37, 27, -56, 91, 113, -23, -40, -121, 35, 79, 3, 79, 58, -54, -11, -41, -48, -109, -54, 96, 80, 77, -69, -88, -75, -126, -64, 54, 33, 7, 121, 16, -49, 26, 68, 94, 107, -79, -17, -67, -59, 57, -8, -36, 99, 29, -2, 36, -91, 70, 56, 76, 88, 40, 85, -16, 120, -101, -21, 83, 103, -91, 28, 14, 17, 73, -102, -121, 69, -102, 18, -115, -92, -5, -50, -20};
System.out.println("resultBytes length = " + b1.length);

String s = new String(b1, "utf-8");
System.out.println("cipherText length = " + s.length());

byte[] b2 = s.getBytes("utf-8");
System.out.println("newResultBytes length = " + b2.length);

By running this, I got output:

length of b1 = 496
length of s = 470
length of b2 = 877

why they are so different?

4
  • Maybe the input is not UTF-8? Maybe Ascii? Commented Mar 23, 2018 at 9:28
  • For the simple answer, wrong charset. Test with "ISO-8859-1". But as for the "why UTF8 reduce the size ?", this is a good question... I don't know how UTF-8 decoder works. Commented Mar 23, 2018 at 9:51
  • Yes, it works.Thanks DavidIbl, AxelH. when change to "ISO-8859-1" the lengths become the same. But why only "ISO-8859-1" works? I also test with "ascii" and got wrong lengths. Maybe I should ask what charSet should be specified when storing random binary data in a string? Commented Mar 23, 2018 at 10:21
  • @LinaRalph ASCII will not work as ASCII is only defined for the range 0..127. Your byte[] also contains characters >127. Other encodings (everything ISO-8859-* or any single byte encoding) should work the same. Commented Mar 23, 2018 at 12:07

1 Answer 1

1

In UTF-8 encoding a character may have more than 1 byte.

Example:

Character -> Codepoints -> UTF-8 Encoding
ä         -> 00E4       -> C3 A4

So 2 bytes in the input can be displayed as 1 character in the output.

Now in Unicode you can decompose characters (especially foreign languages). So to keep my example the character ä can be decomposed to

¨a

This are now 2 characters that have the following encodings

Character -> Codepoints -> UTF-8 Encoding
¨a        -> 00A4 0061  -> C2 A4 61

Especially if you use asian languages this decomposing takes place more often then in this example.

So for this example (and when the decomposing takes place, which is not for sure in every language) you would have the following output of your program:

length of b1 = 2
length of s = 1
length of b2 = 3

I think that can explain your findings.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! Could you explain that although I specify the same charset, it can't convert back to the original byte array b1. Is this caused by some data missing in conversions? String s = new String(b1, "utf-8"); byte[] b2 = s.getBytes("utf-8");
@LinaRalph As I wrote there are some characters in asian languages that get decomposed while using UTF-8, this means that there is a composed character in your original byte[] and when it is converted to UTF-8 the character gets decomposed and while converting back it gets converted into multiple bytes (as I tried to explain in my answer).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.