weird encodings output with the same string length

Question

I found some tricky place and couldn't understand how does this exactly happen.

Why string which contains one character can return different byte arrays?

Code:

public class Application {
    public static void main(String[] args) throws Exception {

        char ch;
        ch = 0x0001;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
        ch = 0x0111;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
        ch = 0x1111;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
    }
}

Output will be next:

[1]
[-60, -111]
[-31, -124, -111]

Why exactly this happen?

Tim Pietzcker · Accepted Answer · 2013-11-23 13:40:16Z

2

That's how UTF-8 works. Codepoints between 0 and 127 are encoded as single-byte values (to maintain ASCII compatibility); codepoints above that are encoded as two- to six-byte values.

Wikipedia screenshot

Screenshot taken from here.

So, for your examples:

0x0001 (0b00000001) is encoded as
(bin) 00000001 = (dec) 1
0x0111 (0b00000001 00010001) is encoded as
(bin) 11000100 10010001 = (dec) -60 -111
0x1111 (0b00010001 00010001) is encoded as
(bin) 11100001 11100001 10010001 = (dec) -31 -124 -111

edited Nov 23, 2013 at 13:40

answered Nov 23, 2013 at 12:07

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

weird encodings output with the same string length

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related