Java UTF-16 Encoding code

Question

The function that encodes a Unicode Code Point (Integer) to a char array (Bytes) in java is basically this:

return new char[] { (char) codePoint };

Which is just a cast from the integer value to a char.

I would like to know how this cast is actually done, the code behind that cast to make the conversion from an integer value to a character encoded in UTF-16. I tried looking for it on the java source codes but with no luck.

In Java, a charis not a "byte"-sized (8 bit) entity, but a two-byte value. — Dirk
– Dirk, Commented May 3, 2011 at 20:30

Community · Accepted Answer · 2021-10-07 05:51:50Z

9

I'm not sure which function you're talking about.

Casting valid int code points to char will work for code points in the basic multilingual plane just due to how UTF-16 was defined. To convert anything above U+FFFF you should use Character.toChars(int) to convert to UTF-16 code units. The algorithm is defined in RFC 2781.

edited Oct 7, 2021 at 5:51

CommunityBot

11 silver badge

answered May 3, 2011 at 20:33

McDowell

109k31 gold badges207 silver badges272 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user166390 Over a year ago

Because of surrogate pairs, not all values of char represent valid code-points (outside of a pair) -- even if all values of char are valid numbers. E.g. it's not just "anything above 0xffff". +1 For inclusion of the conversion method (which answers the question) and link, however.

McDowell Over a year ago

@pst - in case it is not apparent, "anything" in this case means a valid Unicode code point as defined in the spec (Unicode 4 for Java 6).

McDowell Over a year ago

@pst - from Unicode 4, chapter 2: the lowest plane, the Basic Multilingual Plane, consists of the range 0000..FFFF. (numbers are base 16)

Philipp Over a year ago

All numbers representable by char are valid code points, but not all are valid scalar values. (Scalar values are code points which are not surrogate code points.) So it is true that not all char values are scalar values, and not all possible char sequences are UTF-16 strings. Character.toChars, however, does not check whether the argument is a valid scalar value.

McDowell Over a year ago

@pst - ah, I see what you meant in your deleted comment - you were quoting the algorithm: If U < 0x10000, encode U as a 16-bit unsigned integer and terminate. I will amend the answer.

|

dfb · Accepted Answer · 2011-05-03 20:26:50Z

0

The code point is just a number that maps to a character, there's no real conversion going on. Unicode code points are specified in hexadecimal, so whatever you codePoint is in hex will map to that character (or glyph).

answered May 3, 2011 at 20:26

dfb

13.3k2 gold badges33 silver badges53 bronze badges

1 Comment

user166390 Over a year ago

(Or map to a surrogate pair of char ...)

Joachim Sauer · Accepted Answer · 2011-05-03 20:28:11Z

0

Since a char is defined to hold UTF-16 data in Java, this is all there is to it. Only if the input is an int (i.e. it can represent a Unicode codepoint of U+10000 or greater) is some calculation necessary. All char values are already UTF-16.

answered May 3, 2011 at 20:28

Joachim Sauer

309k59 gold badges568 silver badges624 bronze badges

1 Comment

user166390 Over a year ago

Not necessarily true. A char is just a 16-bit value (no notion of surrogate pairs by itself). Good for pointing out the range of Unicode vs. char, though.

Abdullah Jibaly · Accepted Answer · 2011-05-03 20:28:46Z

0

All chars in Java are represented internally in UTF-16. This is just mapping the integer value to that char.

answered May 3, 2011 at 20:28

Abdullah Jibaly

55.1k45 gold badges129 silver badges201 bronze badges

1 Comment

user166390 Over a year ago

Not necessarily true. A char is just a 16-bit value (no notion of surrogate pairs by itself). Perhaps should talk about character literals in this context?

igordc · Accepted Answer · 2011-05-03 20:29:05Z

0

Also, char arrays are already UTF-16, in the Java platform.

answered May 3, 2011 at 20:29

igordc

1,5451 gold badge15 silver badges20 bronze badges

2 Comments

user166390 Over a year ago

Not necessarily true. Even though a char is 16 bits, an array of characters can hold data which is not valid UTF-16 (invalid surrogate pairs, for instance). Not all code-points fit in a char.

igordc Over a year ago

Right, I meant each char of an array of chars is UTF-16, because of skiforfun's array, however I might have not correctly understood his question.

Collectives™ on Stack Overflow

Java UTF-16 Encoding code

5 Answers 5

6 Comments

1 Comment

1 Comment

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

1 Comment

1 Comment

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related