2

How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?

I have tried something like:

Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));

but it does not convert to a valid code point.

2
  • 2
    FWIW, a Java code point cannot be represented as 2 hex digits. All the code points in the Basic Multilingual Plane require 4 hex digits (0x0000 to 0xFFFF). It is not entirely correct to refer to an 8-bit UTF-8 encoding as a "Unicode Code Point". Commented Aug 31, 2015 at 0:46
  • 3
    If you came here because you have an int code point, use Character.toString. Commented Oct 19, 2021 at 14:56

2 Answers 2

4

That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).

Update

Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.

String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
    assert src.charAt(j) == '=';
    bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
                      Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);

System.out.println(str);

Output

Газета
Sign up to request clarification or add additional context in comments.

Comments

1

In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.

All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.

Here is some sample code that shows one way this can be achieved:

public static void main(String[] args) throws Exception {
    String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

    // Parse string into hex string tokens.
    String[] tokens = Arrays.stream(src.split("="))
            .filter(s -> s.length() != 0)
            .toArray(String[]::new);

    // Convert the hex string representations to a byte array.
    byte[] utf8bytes = new byte[tokens.length];
    for (int i = 0; i < utf8bytes.length; i++) {
        utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
    }

    // Convert UTF-8 bytes to Java String.
    String str = new String(utf8bytes, StandardCharsets.UTF_8);

    // Display string + individual unicode code points.
    System.out.println(str);
    str.codePoints().forEach(System.out::println);
}

Output:

Газета
1043
1072
1079
1077
1090
1072

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.