2

The function that encodes a Unicode Code Point (Integer) to a char array (Bytes) in java is basically this:

return new char[] { (char) codePoint };

Which is just a cast from the integer value to a char.

I would like to know how this cast is actually done, the code behind that cast to make the conversion from an integer value to a character encoded in UTF-16. I tried looking for it on the java source codes but with no luck.

1
  • In Java, a charis not a "byte"-sized (8 bit) entity, but a two-byte value. Commented May 3, 2011 at 20:30

5 Answers 5

9

I'm not sure which function you're talking about.

Casting valid int code points to char will work for code points in the basic multilingual plane just due to how UTF-16 was defined. To convert anything above U+FFFF you should use Character.toChars(int) to convert to UTF-16 code units. The algorithm is defined in RFC 2781.

Sign up to request clarification or add additional context in comments.

6 Comments

Because of surrogate pairs, not all values of char represent valid code-points (outside of a pair) -- even if all values of char are valid numbers. E.g. it's not just "anything above 0xffff". +1 For inclusion of the conversion method (which answers the question) and link, however.
@pst - in case it is not apparent, "anything" in this case means a valid Unicode code point as defined in the spec (Unicode 4 for Java 6).
@pst - from Unicode 4, chapter 2: the lowest plane, the Basic Multilingual Plane, consists of the range 0000..FFFF. (numbers are base 16)
All numbers representable by char are valid code points, but not all are valid scalar values. (Scalar values are code points which are not surrogate code points.) So it is true that not all char values are scalar values, and not all possible char sequences are UTF-16 strings. Character.toChars, however, does not check whether the argument is a valid scalar value.
@pst - ah, I see what you meant in your deleted comment - you were quoting the algorithm: If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. I will amend the answer.
|
0

The code point is just a number that maps to a character, there's no real conversion going on. Unicode code points are specified in hexadecimal, so whatever you codePoint is in hex will map to that character (or glyph).

1 Comment

(Or map to a surrogate pair of char ...)
0

Since a char is defined to hold UTF-16 data in Java, this is all there is to it. Only if the input is an int (i.e. it can represent a Unicode codepoint of U+10000 or greater) is some calculation necessary. All char values are already UTF-16.

1 Comment

Not necessarily true. A char is just a 16-bit value (no notion of surrogate pairs by itself). Good for pointing out the range of Unicode vs. char, though.
0

All chars in Java are represented internally in UTF-16. This is just mapping the integer value to that char.

1 Comment

Not necessarily true. A char is just a 16-bit value (no notion of surrogate pairs by itself). Perhaps should talk about character literals in this context?
0

Also, char arrays are already UTF-16, in the Java platform.

2 Comments

Not necessarily true. Even though a char is 16 bits, an array of characters can hold data which is not valid UTF-16 (invalid surrogate pairs, for instance). Not all code-points fit in a char.
Right, I meant each char of an array of chars is UTF-16, because of skiforfun's array, however I might have not correctly understood his question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.