Java String Unicode Value

Question

How can I get the unicode value of a string in java?

For example if the string is "Hi" I need something like \uXXXX\uXXXX

Why? What exactly are you trying to do? charAt() will help. If you want Unicode codepoints instead of UTF-16 code units, then codePointAt() is the more correct approach (but that won't help if you want to write \u escapes for Java source code or similar). — Joachim Sauer
– Joachim Sauer, Commented Apr 20, 2011 at 17:01
To simplify everything, I have a string that is in English from a java source file. It gets converted to Japanese. I then need the \uXXXX unicode value because the English string will be replaced with the Japanese in the source file. — user489041
– user489041, Commented Apr 20, 2011 at 17:05
@user: in that case formatting the value return by charAt() as a 4-digit hex number and prepending \u should work. — Joachim Sauer
– Joachim Sauer, Commented Apr 20, 2011 at 17:07

Raghu A · Accepted Answer · 2013-02-11 17:52:59Z

Some unicode characters span two Java chars. Quote from http://docs.oracle.com/javase/tutorial/i18n/text/unicode.html :

The characters with values that are outside of the 16-bit range, and within the range from 0x10000 to 0x10FFFF, are called supplementary characters and are defined as a pair of char values.

correct way to escape non-ascii:

private static String escapeNonAscii(String str) {

  StringBuilder retStr = new StringBuilder();
  for(int i=0; i<str.length(); i++) {
    int cp = Character.codePointAt(str, i);
    int charCount = Character.charCount(cp);
    if (charCount > 1) {
      i += charCount - 1; // 2.
      if (i >= str.length()) {
        throw new IllegalArgumentException("truncated unexpectedly");
      }
    }

    if (cp < 128) {
      retStr.appendCodePoint(cp);
    } else {
      retStr.append(String.format("\\u%x", cp));
    }
  }
  return retStr.toString();
}

Joachim Sauer · Accepted Answer · 2014-01-07 14:42:49Z

12

This method converts an arbitrary String to an ASCII-safe representation to be used in Java source code (or properties files, for example):

public String escapeUnicode(String input) {
  StringBuilder b = new StringBuilder(input.length());
  Formatter f = new Formatter(b);
  for (char c : input.toCharArray()) {
    if (c < 128) {
      b.append(c);
    } else {
      f.format("\\u%04x", (int) c);
    }
  }
  return b.toString();
}

edited Jan 7, 2014 at 14:42

answered Apr 20, 2011 at 17:11

Joachim Sauer

309k59 gold badges568 silver badges624 bronze badges

2 Comments

tchrist Over a year ago

@user489041: I disagree: The right way to do this is to compile with java -encoding UTF-8. No mess, no fuss. This is especially because 20 years on, Java still has no standard way to talk about code points by their official names. That means you are trying to insert evil and mysterious magic numbers in your code. That is not a good thing! Sure, I might rather see "\N{GREEK SMALL LETTER ALPHA}" than "α", but I SURELY do not want to see "\u03B1"! That’s just wicked. How are you going to maintain that kind of crudola?

Joachim Sauer Over a year ago

@Martin: 1.) strictly speaking "Unicode" is not an n-bit character set for any value of n. 2.) most Japanese characters fall into the basic multilingual pane (the first 64k Unicode codepoints) and can be represented with just 4 hexadecimal digits and 3.) the unicode escapes in Java use UTF-16, so if you have to present anything outside the BMP, you'll have to use two \u escapes (with the correct surrogate values) which is incidentally what my code does because a char is really a UTF-16 codepoint and not a Unicode codepoint (those two are the same thing, iff the character is in the BMP).

Collectives™ on Stack Overflow

Java String Unicode Value

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related