How can I get the unicode value of a string in java?
For example if the string is "Hi" I need something like \uXXXX\uXXXX
Some unicode characters span two Java chars. Quote from http://docs.oracle.com/javase/tutorial/i18n/text/unicode.html :
The characters with values that are outside of the 16-bit range, and within the range from 0x10000 to 0x10FFFF, are called supplementary characters and are defined as a pair of char values.
correct way to escape non-ascii:
private static String escapeNonAscii(String str) {
StringBuilder retStr = new StringBuilder();
for(int i=0; i<str.length(); i++) {
int cp = Character.codePointAt(str, i);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
}
}
return retStr.toString();
}
This method converts an arbitrary String to an ASCII-safe representation to be used in Java source code (or properties files, for example):
public String escapeUnicode(String input) {
StringBuilder b = new StringBuilder(input.length());
Formatter f = new Formatter(b);
for (char c : input.toCharArray()) {
if (c < 128) {
b.append(c);
} else {
f.format("\\u%04x", (int) c);
}
}
return b.toString();
}
java -encoding UTF-8. No mess, no fuss. This is especially because 20 years on, Java still has no standard way to talk about code points by their official names. That means you are trying to insert evil and mysterious magic numbers in your code. That is not a good thing! Sure, I might rather see "\N{GREEK SMALL LETTER ALPHA}" than "α", but I SURELY do not want to see "\u03B1"! That’s just wicked. How are you going to maintain that kind of crudola?char is really a UTF-16 codepoint and not a Unicode codepoint (those two are the same thing, iff the character is in the BMP).
charAt()will help. If you want Unicode codepoints instead of UTF-16 code units, thencodePointAt()is the more correct approach (but that won't help if you want to write\uescapes for Java source code or similar).charAt()as a 4-digit hex number and prepending\ushould work.