7

I would like to be able to convert a raw UTF-8 string to a Hex string. In the example below I've created a sample UTF-8 string containing 2 letters. Then I'm trying to get the Hex values but it gives me negative values.

How can I make it give me 05D0 and 05D1

String a = "\u05D0\u05D1";
byte[] xxx = a.getBytes("UTF-8");

for (byte x : xxx) {
   System.out.println(Integer.toHexString(x));
}

Thank you.

2 Answers 2

6

Don't convert to an encoding like UTF-8 if you want the code point. Use Character.codePointAt.

For example:

Character.codePointAt("\u05D0\u05D1", 0) // returns 1488, or 0x5d0
Sign up to request clarification or add additional context in comments.

3 Comments

Well, do you want the hex values of the UTF-8 (0xD790) or the code point (0x000005D0)? If you want the code point, convert the bytes to a string with new String(bytes, "UTF-8") and then use Character.codePointAt(...).toHexString() to get the hex representation.
Maybe I'm missing something. Character.codePointAt doesn't have a toHexString method, it returns an integer. Can you please give me complete example? Thanks
Oops, toHexString is a static method. System.out.println(Integer.toHexString(Character.codePointAt("\u05D0", 0))) will print out 5d0. If you want to pad it with zeros on the left, try System.out.printf("%08x", Character.codePointAt("\u05D0", 0)) which prints 000005d0.
3

Negative values occur because the range of byte is from -128 to 127. The following code will produce positive values:

String a = "\u05D0\u05D1";
byte[] xxx = a.getBytes("UTF-8");

for (byte x : xxx) {
    System.out.println(Integer.toHexString(x & 0xFF));
}

The main difference is that it outputs x & 0xFF instead of just x, this operation converts byte to int, dropping the sign.

5 Comments

Thank you for the quick reply, but it still doesn't give the right values. I'm trying to reproduce the Hex values of 05D0, the code gives me d7 90
@thedp It happens because the symbols you encode are represented in UTF-8 by these bytes. If you want to receive the bytes you said, you should use UTF-16.
UTF-8 encoding doesn't do what you think it does I suspect. Each value is encoded over multiple bytes. See en.wikipedia.org/wiki/UTF-8#Description for details.
Exactly. D7 90 in binary is 11010111 10010000. Here 110 at the start of the first byte is simply an indicator that there will be the next byte. 10 at the start of the second byte says that it isn't the first byte. If we remove them, we have the following number 10111 010000, which is exactly 5D0 in hex. This is how the decoding process works in UTF-8.
Thank you for explaining this topic to me.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.