What is the character encoding of String in Java?

Question

I am actually confused regarding the encoding of strings in Java. I have a couple of questions. Please help me if you know the answer to them:

1) What is the native encoding of Java strings in memory? When I write String a = "Hello" in which format will it be stored? Since Java is machine independent I don't think the system will do the encoding.

2) I read on the net that "UTF-16" is the default encoding but I got confused because say when I write that int a = 'c' I get the number of the character in the ASCII table. So are ASCII and UTF-16 the same?

3) Also I wasn't sure on what the storage of a string in the memory depends: OS, language?

You should consider breaking these out into individual questions, as they are really very different. #2 can probably be answered here: stackoverflow.com/questions/1490218/… — Ethel Evans
– Ethel Evans, Commented Dec 15, 2010 at 18:05

Laurence Gonsalves · Accepted Answer · 2010-12-15 18:11:08Z

44

Java stores strings as UTF-16 internally.
"default encoding" isn't quite right. Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies from platform to platform, and can even be altered by things like environment variables on some platforms.

ASCII is a subset of Latin 1 which is a subset of Unicode. UTF-16 is a way of encoding Unicode. So if you perform your int i = 'x' test for any character that falls in the ASCII range you'll get the ASCII value. UTF-16 can represent a lot more characters than ASCII, however.
From the java.lang.Character docs:

The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.

So it's defined as part of the Java 2 platform that UTF-16 is used for these classes.

edited Dec 15, 2010 at 18:11

answered Dec 15, 2010 at 18:04

Laurence Gonsalves

144k38 gold badges264 silver badges315 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

jarnbjo Over a year ago

The usage of char and char arrays is only defined for the public, external API for String and StringBuffer. The internal storage of the characters is implementation specific.

Laurence Gonsalves Over a year ago

@jarnbjo The above is a direct quote from the docs. The char datatype in Java represents a UTF-16 code unit (not a character, aka Unicode codepoint) so I think it's pretty safe to say that Java the language's representation of text is UTF-16. Yes, conceivably an implementation could choose to do something different under the covers, but in the end they'd have to make it look just like they were using UTF-16.

jarnbjo Over a year ago

Since there is no way to access the internal storage of the String and StringBuffer classes, it makes to sense to assume that the statement you quote apply to it.

Hendy Irawan Over a year ago

UTF-16BE or UTF-16LE ?

Laurence Gonsalves Over a year ago

@HendyIrawan Jana doesn't let you access the individual bytes, only the chars (which correspond to UTF-16 code units), so there is no set endian. The actual endian used in memory is JVM/platform dependent, just like the endian used to store an int in memory.

towi · Accepted Answer · 2017-06-12 08:59:31Z

21

1) Strings are objects, which typically contain a char array and the strings's length. The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order.

2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. Thus 'c', which is U+0063, becomes 0x0063, or 99.

3) Since each String is an object, it contains other information than its class members (e.g., class descriptor word, lock/semaphore word, etc.).

ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (i.e., some libraries may be more efficient than others).

EXAMPLE
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); a String object also contains an int length and a char[] array reference. The actual character contents of the string are stored in a second object, the char[] array, which in turn is allocated two words, plus an array length word, plus as many 16-bit char elements as needed for the string (plus any extra chars that were left hanging around when the string was created).

ADDENDUM 2
The case that one char represents one Unicode character is only true in most of the cases. This would imply UCS-2 encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use two chars in a Java String.

Take a look at the actual source code for Apache's implementation, e.g. at:
http://www.docjar.com/html/api/java/lang/String.java.html

edited Jun 12, 2017 at 8:59

towi

22.5k29 gold badges112 silver badges199 bronze badges

answered Dec 15, 2010 at 18:09

David R Tribble

12.2k5 gold badges46 silver badges55 bronze badges

4 Comments

user506710 Over a year ago

Actually what do you intend to say in your 3) part. It contains other information so .... ??

Hawkeye Parker Over a year ago

"Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent." What's a little confusing here is that the Unicode encoding coincides with ASCII for the first 256 characters. Unicode correlates with Extended ASCII (8-bit) for the first 256 characters; Extended ASCII, in turn, corresponds directly with 7-bit ASCII for the first 128. So that 'c' is encoded as 0x63 in Unicode, Extended ASCII, and ASCII. This is why you'd see the int for 'c' and think it's ASCII (which it sortof is :).

David R Tribble Over a year ago

@HawkeyeParker: Yes, 7-bit ASCII (ISO 646) and 8-bit ISO 8859-1 (Latin-1) are proper subsets of Unicode. That being said, Java encodes all character values as 16-bit Unicode.

Hawkeye Parker Over a year ago

absolutely. I was just clarifying for those who might be confused by the overlap.

Ralph · Accepted Answer · 2010-12-15 22:50:00Z

7

While this doesn't answer your question, it is worth noting that... In the java byte code (class file), the string is stored in UTF-8. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html

edited Dec 15, 2010 at 22:50

answered Dec 15, 2010 at 18:04

Ralph

121k57 gold badges300 silver badges391 bronze badges

7 Comments

Ralph Over a year ago

@Loadmaster I belive it is a useful information, and I explicite mentiond that it is the class file - so whats your probelm?

Sergei Tachenov Over a year ago

But it doesn't answer the question. You could post it as a comment and begin with something like "While this doesn't answer your question, it is worth noting that..." This is indeed a useful piece of information, though, I had no idea they used UTF-8. What's the point? It means that JVM has to convert every string to UTF-16 on startup.

David R Tribble Over a year ago

@Sergey Tachenov: Strings are stored as UTF-8 so that .class files are smaller (on average).

Sergei Tachenov Over a year ago

This doesn't matter at all when you put them in a JAR file which you usually do. UTF-16 will be compressed almost twice as efficiently.

Ralph Over a year ago

@parsecer: Oracel's documentation is quite strict about this "encoding : Set the source file encoding name, such as EUC-JP and UTF-8" - so this is only the source file (*.java) encoding, the encoding of Strings in *.class files keep UTF-8

|

LaGrandMere · Accepted Answer · 2010-12-15 18:29:05Z

2

Edit : thanks to LoadMaster for helping me correcting my answer :)

1) All internal String processing is made in UTF-16.

2) ASCII is a subset of UTF-16.

3) Internally in Java is UTF-16. For the rest, it depends on where you are, yes.

edited Dec 15, 2010 at 18:29

answered Dec 15, 2010 at 18:07

LaGrandMere

10.4k1 gold badge36 silver badges42 bronze badges

3 Comments

David R Tribble Over a year ago

Strings are stored internally (in memory) as char[], each element containing a 16-bit UTF-16 Unicode character. UTF-8 is not used to store strings internally, but is used for converting I/O streams to/from strings.

LaGrandMere Over a year ago

@LoadMaster : has it changed during time ? Java was always internally in UTF-16 ?

David R Tribble Over a year ago

Yes, String has always used an internal char[] to store its character values.

Collectives™ on Stack Overflow

What is the character encoding of String in Java?

4 Answers 4

5 Comments

4 Comments

7 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

4 Comments

7 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related