51

I am actually confused regarding the encoding of strings in Java. I have a couple of questions. Please help me if you know the answer to them:

1) What is the native encoding of Java strings in memory? When I write String a = "Hello" in which format will it be stored? Since Java is machine independent I don't think the system will do the encoding.

2) I read on the net that "UTF-16" is the default encoding but I got confused because say when I write that int a = 'c' I get the number of the character in the ASCII table. So are ASCII and UTF-16 the same?

3) Also I wasn't sure on what the storage of a string in the memory depends: OS, language?

1
  • You should consider breaking these out into individual questions, as they are really very different. #2 can probably be answered here: stackoverflow.com/questions/1490218/… Commented Dec 15, 2010 at 18:05

4 Answers 4

44
  1. Java stores strings as UTF-16 internally.

  2. "default encoding" isn't quite right. Java stores strings as UTF-16 internally, but the encoding used externally, the "system default encoding", varies from platform to platform, and can even be altered by things like environment variables on some platforms.

    ASCII is a subset of Latin 1 which is a subset of Unicode. UTF-16 is a way of encoding Unicode. So if you perform your int i = 'x' test for any character that falls in the ASCII range you'll get the ASCII value. UTF-16 can represent a lot more characters than ASCII, however.

  3. From the java.lang.Character docs:

    The Java 2 platform uses the UTF-16 representation in char arrays and in the String and StringBuffer classes.

    So it's defined as part of the Java 2 platform that UTF-16 is used for these classes.

Sign up to request clarification or add additional context in comments.

5 Comments

The usage of char and char arrays is only defined for the public, external API for String and StringBuffer. The internal storage of the characters is implementation specific.
@jarnbjo The above is a direct quote from the docs. The char datatype in Java represents a UTF-16 code unit (not a character, aka Unicode codepoint) so I think it's pretty safe to say that Java the language's representation of text is UTF-16. Yes, conceivably an implementation could choose to do something different under the covers, but in the end they'd have to make it look just like they were using UTF-16.
Since there is no way to access the internal storage of the String and StringBuffer classes, it makes to sense to assume that the statement you quote apply to it.
UTF-16BE or UTF-16LE ?
@HendyIrawan Jana doesn't let you access the individual bytes, only the chars (which correspond to UTF-16 code units), so there is no set endian. The actual endian used in memory is JVM/platform dependent, just like the endian used to store an int in memory.
21

1) Strings are objects, which typically contain a char array and the strings's length. The character array is usually implemented as a contiguous array of 16-bit words, each one containing a Unicode character in native byte order.

2) Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent. Thus 'c', which is U+0063, becomes 0x0063, or 99.

3) Since each String is an object, it contains other information than its class members (e.g., class descriptor word, lock/semaphore word, etc.).

ADENDUM
The object contents depend on the JVM implementation (which determines the inherent overhead associated with each object), and how the class is actually coded (i.e., some libraries may be more efficient than others).

EXAMPLE
A typical implementation will allocate an overhead of two words per object instance (for the class descriptor/pointer, and a semaphore/lock control word); a String object also contains an int length and a char[] array reference. The actual character contents of the string are stored in a second object, the char[] array, which in turn is allocated two words, plus an array length word, plus as many 16-bit char elements as needed for the string (plus any extra chars that were left hanging around when the string was created).

ADDENDUM 2
The case that one char represents one Unicode character is only true in most of the cases. This would imply UCS-2 encoding and true before 2005. But by now Unicode has become larger and Strings have to be encoded using UTF-16 -- where alas a single Unicode character may use two chars in a Java String.

Take a look at the actual source code for Apache's implementation, e.g. at:
http://www.docjar.com/html/api/java/lang/String.java.html

4 Comments

Actually what do you intend to say in your 3) part. It contains other information so .... ??
"Assigning a character value to an integer converts the 16-bit Unicode character code into its integer equivalent." What's a little confusing here is that the Unicode encoding coincides with ASCII for the first 256 characters. Unicode correlates with Extended ASCII (8-bit) for the first 256 characters; Extended ASCII, in turn, corresponds directly with 7-bit ASCII for the first 128. So that 'c' is encoded as 0x63 in Unicode, Extended ASCII, and ASCII. This is why you'd see the int for 'c' and think it's ASCII (which it sortof is :).
@HawkeyeParker: Yes, 7-bit ASCII (ISO 646) and 8-bit ISO 8859-1 (Latin-1) are proper subsets of Unicode. That being said, Java encodes all character values as 16-bit Unicode.
absolutely. I was just clarifying for those who might be confused by the overlap.
7

While this doesn't answer your question, it is worth noting that... In the java byte code (class file), the string is stored in UTF-8. http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html

7 Comments

@Loadmaster I belive it is a useful information, and I explicite mentiond that it is the class file - so whats your probelm?
But it doesn't answer the question. You could post it as a comment and begin with something like "While this doesn't answer your question, it is worth noting that..." This is indeed a useful piece of information, though, I had no idea they used UTF-8. What's the point? It means that JVM has to convert every string to UTF-16 on startup.
@Sergey Tachenov: Strings are stored as UTF-8 so that .class files are smaller (on average).
This doesn't matter at all when you put them in a JAR file which you usually do. UTF-16 will be compressed almost twice as efficiently.
@parsecer: Oracel's documentation is quite strict about this "encoding : Set the source file encoding name, such as EUC-JP and UTF-8" - so this is only the source file (*.java) encoding, the encoding of Strings in *.class files keep UTF-8
|
2

Edit : thanks to LoadMaster for helping me correcting my answer :)

1) All internal String processing is made in UTF-16.

2) ASCII is a subset of UTF-16.

3) Internally in Java is UTF-16. For the rest, it depends on where you are, yes.

3 Comments

Strings are stored internally (in memory) as char[], each element containing a 16-bit UTF-16 Unicode character. UTF-8 is not used to store strings internally, but is used for converting I/O streams to/from strings.
@LoadMaster : has it changed during time ? Java was always internally in UTF-16 ?
Yes, String has always used an internal char[] to store its character values.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.