issue: length changed when convert between String and byte array?

Question

I thought the length of output will always keep the same when converting between byte[] and String. But below example shows this is incorret.

byte[] b1 = {55, -71, -35, -35, 83, -115, 107, -80, -62, 86, 98, 125, -68, -12, 14, -92, -122, -65, -117, -26, 80, -102, 75, 49, -120, -10, 18, -8, 82, -21, 49, 80, 125, 94, -35, -66, 91, 79, 77, -29, -48, -85, 29, -48, -118, -13, -84, -77, 93, -101, -7, 46, -44, -25, -42, 72, -33, -81, -120, -40, 40, 65, 58, -74, -34, 99, -8, -118, 83, 110, -94, 69, 21, -27, 114, 43, -23, 7, 120, -15, 21, 110, 108, 98, -99, 7, 107, 63, -48, 32, 123, 35, -36, -35, 7, -75, 40, -3, 33, 92, -79, 119, 22, -63, 27, 123, -98, 92, -93, 30, 51, 55, 106, -109, 99, 123, 25, -111, -53, 66, 117, 121, -20, 6, -10, -34, -76, -120, -56, 123, 48, -9, -116, -81, -47, 67, 80, 14, -58, -17, -92, -75, 119, 27, 125, -115, -31, 114, -96, 126, -87, 98, -108, -21, -113, 36, 104, -69, -74, 41, -68, 115, 103, 106, -39, 10, 0, 7, -66, 84, -94, 46, -1, -62, -115, 104, -104, 53, 86, -117, 15, -100, 46, 7, 57, -84, 40, 118, -12, 93, -6, -31, 28, 81, -72, 123, 54, -76, 123, 111, 54, 121, 126, -19, -32, 99, 109, -68, -103, 29, 75, 57, 115, 33, 110, -23, -116, 11, 112, 117, 67, -100, 21, 94, -16, 94, 24, 47, -90, -48, 30, 15, 24, 98, -114, -96, 37, -47, 32, 74, 110, 58, 35, 77, 62, -74, 94, 59, 63, -35, -59, 10, 43, 65, -63, 59, -65, 58, 69, 88, -91, -58, -103, 88, 6, -105, 92, -9, -19, 26, 5, -42, -38, -82, -56, 42, -45, 30, 103, -113, -64, -82, 29, 6, 40, 102, 44, 59, 51, -69, -70, 90, -126, 40, -105, 103, 92, 124, 120, 43, -53, 73, -109, 103, -62, -64, -68, -81, -61, -68, -73, -6, -112, 85, 119, -92, -85, -31, -37, 32, -2, 100, 34, 41, -128, 73, -92, -94, 71, 98, 0, 126, -98, -51, -8, -72, -97, 66, -71, -14, -74, -39, 56, 71, 46, -94, 40, 32, -84, -17, -128, 60, 25, 75, -104, 25, 49, -14, -103, -89, 97, -61, 89, -109, 118, 114, 123, -38, 101, 98, 7, 70, 9, 42, 98, -94, 73, -70, 72, 43, 52, -89, -20, -22, -58, -109, -88, 36, 118, 71, -34, -85, -24, -46, -120, -118, 5, -118, -53, -5, -87, -116, -38, 101, 74, -111, -2, 12, 48, -105, -110, 6, -114, 31, 70, -42, -118, -61, 82, 83, -37, 27, -56, 91, 113, -23, -40, -121, 35, 79, 3, 79, 58, -54, -11, -41, -48, -109, -54, 96, 80, 77, -69, -88, -75, -126, -64, 54, 33, 7, 121, 16, -49, 26, 68, 94, 107, -79, -17, -67, -59, 57, -8, -36, 99, 29, -2, 36, -91, 70, 56, 76, 88, 40, 85, -16, 120, -101, -21, 83, 103, -91, 28, 14, 17, 73, -102, -121, 69, -102, 18, -115, -92, -5, -50, -20};
System.out.println("resultBytes length = " + b1.length);

String s = new String(b1, "utf-8");
System.out.println("cipherText length = " + s.length());

byte[] b2 = s.getBytes("utf-8");
System.out.println("newResultBytes length = " + b2.length);

By running this, I got output:

length of b1 = 496
length of s = 470
length of b2 = 877

why they are so different?

For the simple answer, wrong charset. Test with "ISO-8859-1". But as for the "why UTF8 reduce the size ?", this is a good question... I don't know how UTF-8 decoder works. — AxelH
– AxelH, Commented Mar 23, 2018 at 9:51
Yes, it works.Thanks DavidIbl, AxelH. when change to "ISO-8859-1" the lengths become the same. But why only "ISO-8859-1" works? I also test with "ascii" and got wrong lengths. Maybe I should ask what charSet should be specified when storing random binary data in a string? — LinaRalph
– LinaRalph, Commented Mar 23, 2018 at 10:21
@LinaRalph ASCII will not work as ASCII is only defined for the range 0..127. Your byte[] also contains characters >127. Other encodings (everything ISO-8859-* or any single byte encoding) should work the same. — Uwe Plonus
– Uwe Plonus, Commented Mar 23, 2018 at 12:07

Uwe Plonus · Accepted Answer · 2018-03-23 13:03:08Z

1

In UTF-8 encoding a character may have more than 1 byte.

Example:

Character -> Codepoints -> UTF-8 Encoding
ä         -> 00E4       -> C3 A4

So 2 bytes in the input can be displayed as 1 character in the output.

Now in Unicode you can decompose characters (especially foreign languages). So to keep my example the character ä can be decomposed to

¨a

This are now 2 characters that have the following encodings

Character -> Codepoints -> UTF-8 Encoding
¨a        -> 00A4 0061  -> C2 A4 61

Especially if you use asian languages this decomposing takes place more often then in this example.

So for this example (and when the decomposing takes place, which is not for sure in every language) you would have the following output of your program:

length of b1 = 2
length of s = 1
length of b2 = 3

I think that can explain your findings.

edited Mar 23, 2018 at 13:03

answered Mar 23, 2018 at 10:19

Uwe Plonus

9,9944 gold badges46 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

LinaRalph Over a year ago

Thanks! Could you explain that although I specify the same charset, it can't convert back to the original byte array b1. Is this caused by some data missing in conversions? String s = new String(b1, "utf-8"); byte[] b2 = s.getBytes("utf-8");

Uwe Plonus Over a year ago

@LinaRalph As I wrote there are some characters in asian languages that get decomposed while using UTF-8, this means that there is a composed character in your original byte[] and when it is converted to UTF-8 the character gets decomposed and while converting back it gets converted into multiple bytes (as I tried to explain in my answer).

Collectives™ on Stack Overflow

issue: length changed when convert between String and byte array?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related