Java convert unicode code point to string

Question

How can UTF-8 value like =D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0 be converted in Java?

I have tried something like:

Character.toCodePoint((char)(Integer.parseInt("D0", 16)),(char)(Integer.parseInt("93", 16));

but it does not convert to a valid code point.

FWIW, a Java code point cannot be represented as 2 hex digits. All the code points in the Basic Multilingual Plane require 4 hex digits (0x0000 to 0xFFFF). It is not entirely correct to refer to an 8-bit UTF-8 encoding as a "Unicode Code Point". — scottb
– scottb, Commented Aug 31, 2015 at 0:46
If you came here because you have an int code point, use Character.toString. — cambunctious
– cambunctious, Commented Oct 19, 2021 at 14:56

Andreas · Accepted Answer · 2015-08-31 01:59:43Z

4

That string is an encoding of bytes in hex, so the best way is to decode the string into a byte[], then call new String(bytes, StandardCharsets.UTF_8).

Update

Here is a slightly more direct version of decoding the string, than provided by "sstan" in another answer. Of course both versions are good, so use whichever makes you more comfortable, or write your own version.

String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

assert src.length() % 3 == 0;
byte[] bytes = new byte[src.length() / 3];
for (int i = 0, j = 0; i < bytes.length; i++, j+=3) {
    assert src.charAt(j) == '=';
    bytes[i] = (byte)(Character.digit(src.charAt(j + 1), 16) << 4 |
                      Character.digit(src.charAt(j + 2), 16));
}
String str = new String(bytes, StandardCharsets.UTF_8);

System.out.println(str);

Output

Газета

edited Aug 31, 2015 at 1:59

answered Aug 30, 2015 at 22:50

Andreas

160k13 gold badges164 silver badges262 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sstan · Accepted Answer · 2015-08-31 01:35:22Z

In UTF-8, a single character is not always encoded with the same amount of bytes. Depending on the character, it may require 1, 2, 3, or even 4 bytes to be encoded. Therefore, it's definitely not a trivial matter to try to map UTF-8 bytes yourself to a Java char which uses UTF-16 encoding, where each char is encoded using 2 bytes. Not to mention that, depending on the character (code point > 0xffff), you may also have to worry about dealing with surrogate characters, which is just one more complication that you can easily get wrong.

All this to say that Andreas is absolutely right. You should focus on parsing your string to a byte array, and then let the built-in libraries convert the UTF-8 bytes to a Java string for you. From a Java String, it's trivial to extract the Unicode code points if that's what you want.

Here is some sample code that shows one way this can be achieved:

public static void main(String[] args) throws Exception {
    String src = "=D0=93=D0=B0=D0=B7=D0=B5=D1=82=D0=B0";

    // Parse string into hex string tokens.
    String[] tokens = Arrays.stream(src.split("="))
            .filter(s -> s.length() != 0)
            .toArray(String[]::new);

    // Convert the hex string representations to a byte array.
    byte[] utf8bytes = new byte[tokens.length];
    for (int i = 0; i < utf8bytes.length; i++) {
        utf8bytes[i] = (byte) Integer.parseInt(tokens[i], 16);
    }

    // Convert UTF-8 bytes to Java String.
    String str = new String(utf8bytes, StandardCharsets.UTF_8);

    // Display string + individual unicode code points.
    System.out.println(str);
    str.codePoints().forEach(System.out::println);
}

Output:

Collectives™ on Stack Overflow

Java convert unicode code point to string

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related