15

How can I get the unicode value of a string in java?

For example if the string is "Hi" I need something like \uXXXX\uXXXX

3
  • 3
    Why? What exactly are you trying to do? charAt() will help. If you want Unicode codepoints instead of UTF-16 code units, then codePointAt() is the more correct approach (but that won't help if you want to write \u escapes for Java source code or similar). Commented Apr 20, 2011 at 17:01
  • To simplify everything, I have a string that is in English from a java source file. It gets converted to Japanese. I then need the \uXXXX unicode value because the English string will be replaced with the Japanese in the source file. Commented Apr 20, 2011 at 17:05
  • @user: in that case formatting the value return by charAt() as a 4-digit hex number and prepending \u should work. Commented Apr 20, 2011 at 17:07

2 Answers 2

20

Some unicode characters span two Java chars. Quote from http://docs.oracle.com/javase/tutorial/i18n/text/unicode.html :

The characters with values that are outside of the 16-bit range, and within the range from 0x10000 to 0x10FFFF, are called supplementary characters and are defined as a pair of char values.

correct way to escape non-ascii:

private static String escapeNonAscii(String str) {

  StringBuilder retStr = new StringBuilder();
  for(int i=0; i<str.length(); i++) {
    int cp = Character.codePointAt(str, i);
    int charCount = Character.charCount(cp);
    if (charCount > 1) {
      i += charCount - 1; // 2.
      if (i >= str.length()) {
        throw new IllegalArgumentException("truncated unexpectedly");
      }
    }

    if (cp < 128) {
      retStr.appendCodePoint(cp);
    } else {
      retStr.append(String.format("\\u%x", cp));
    }
  }
  return retStr.toString();
}
Sign up to request clarification or add additional context in comments.

Comments

12

This method converts an arbitrary String to an ASCII-safe representation to be used in Java source code (or properties files, for example):

public String escapeUnicode(String input) {
  StringBuilder b = new StringBuilder(input.length());
  Formatter f = new Formatter(b);
  for (char c : input.toCharArray()) {
    if (c < 128) {
      b.append(c);
    } else {
      f.format("\\u%04x", (int) c);
    }
  }
  return b.toString();
}

2 Comments

@user489041: I disagree: The right way to do this is to compile with java -encoding UTF-8. No mess, no fuss. This is especially because 20 years on, Java still has no standard way to talk about code points by their official names. That means you are trying to insert evil and mysterious magic numbers in your code. That is not a good thing! Sure, I might rather see "\N{GREEK SMALL LETTER ALPHA}" than "α", but I SURELY do not want to see "\u03B1"! That’s just wicked. How are you going to maintain that kind of crudola?
@Martin: 1.) strictly speaking "Unicode" is not an n-bit character set for any value of n. 2.) most Japanese characters fall into the basic multilingual pane (the first 64k Unicode codepoints) and can be represented with just 4 hexadecimal digits and 3.) the unicode escapes in Java use UTF-16, so if you have to present anything outside the BMP, you'll have to use two \u escapes (with the correct surrogate values) which is incidentally what my code does because a char is really a UTF-16 codepoint and not a Unicode codepoint (those two are the same thing, iff the character is in the BMP).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.