JAVA: get UTF-8 Hex values from a string?

Question

I would like to be able to convert a raw UTF-8 string to a Hex string. In the example below I've created a sample UTF-8 string containing 2 letters. Then I'm trying to get the Hex values but it gives me negative values.

How can I make it give me 05D0 and 05D1

String a = "\u05D0\u05D1";
byte[] xxx = a.getBytes("UTF-8");

for (byte x : xxx) {
   System.out.println(Integer.toHexString(x));
}

Thank you.

ataylor · Accepted Answer · 2012-03-14 17:33:12Z

6

Don't convert to an encoding like UTF-8 if you want the code point. Use Character.codePointAt.

For example:

Character.codePointAt("\u05D0\u05D1", 0) // returns 1488, or 0x5d0

answered Mar 14, 2012 at 17:33

ataylor

66.4k25 gold badges164 silver badges190 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ataylor Over a year ago

Well, do you want the hex values of the UTF-8 (0xD790) or the code point (0x000005D0)? If you want the code point, convert the bytes to a string with new String(bytes, "UTF-8") and then use Character.codePointAt(...).toHexString() to get the hex representation.

thedp Over a year ago

Maybe I'm missing something. Character.codePointAt doesn't have a toHexString method, it returns an integer. Can you please give me complete example? Thanks

ataylor Over a year ago

Oops, toHexString is a static method. System.out.println(Integer.toHexString(Character.codePointAt("\u05D0", 0))) will print out 5d0. If you want to pad it with zeros on the left, try System.out.printf("%08x", Character.codePointAt("\u05D0", 0)) which prints 000005d0.

Malcolm · Accepted Answer · 2012-03-14 17:17:18Z

3

Negative values occur because the range of byte is from -128 to 127. The following code will produce positive values:

String a = "\u05D0\u05D1";
byte[] xxx = a.getBytes("UTF-8");

for (byte x : xxx) {
    System.out.println(Integer.toHexString(x & 0xFF));
}

The main difference is that it outputs x & 0xFF instead of just x, this operation converts byte to int, dropping the sign.

answered Mar 14, 2012 at 17:17

Malcolm

41.5k11 gold badges71 silver badges93 bronze badges

5 Comments

thedp Over a year ago

Thank you for the quick reply, but it still doesn't give the right values. I'm trying to reproduce the Hex values of 05D0, the code gives me d7 90

Malcolm Over a year ago

@thedp It happens because the symbols you encode are represented in UTF-8 by these bytes. If you want to receive the bytes you said, you should use UTF-16.

sw1nn Over a year ago

UTF-8 encoding doesn't do what you think it does I suspect. Each value is encoded over multiple bytes. See en.wikipedia.org/wiki/UTF-8#Description for details.

Malcolm Over a year ago

Exactly. D7 90 in binary is 11010111 10010000. Here 110 at the start of the first byte is simply an indicator that there will be the next byte. 10 at the start of the second byte says that it isn't the first byte. If we remove them, we have the following number 10111 010000, which is exactly 5D0 in hex. This is how the decoding process works in UTF-8.

thedp Over a year ago

Thank you for explaining this topic to me.

Collectives™ on Stack Overflow

JAVA: get UTF-8 Hex values from a string?

2 Answers 2

3 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related