0

I found some tricky place and couldn't understand how does this exactly happen.

Why string which contains one character can return different byte arrays?

Code:

public class Application {
    public static void main(String[] args) throws Exception {

        char ch;
        ch = 0x0001;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
        ch = 0x0111;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
        ch = 0x1111;
        System.out.println(Arrays.toString(("" + ch).getBytes("UTF-8")));
    }
}

Output will be next:

[1]
[-60, -111]
[-31, -124, -111]

Why exactly this happen?

1 Answer 1

2

That's how UTF-8 works. Codepoints between 0 and 127 are encoded as single-byte values (to maintain ASCII compatibility); codepoints above that are encoded as two- to six-byte values.

Wikipedia screenshot

Screenshot taken from here.

So, for your examples:

  1. 0x0001 (0b00000001) is encoded as
    (bin) 00000001 = (dec) 1
  2. 0x0111 (0b00000001 00010001) is encoded as
    (bin) 11000100 10010001 = (dec) -60 -111
  3. 0x1111 (0b00010001 00010001) is encoded as
    (bin) 11100001 11100001 10010001 = (dec) -31 -124 -111
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.