UTF-8 in XML constructed using Java

Question

Using Java I am constructing some XML. In the XML some nodes may have values which are in Korean language or some other language. After constructing, how do I make sure that my whole XML is in UTF-8 encoding? Do I need to explicitly change the string to UTF-8 by using something like:

string = new String(s.getBytes(), "UTF-8");

Or will the whole string be automatically in UTF-8?

Also if I get some XML with some UTF-8 like this <name>[B@19821f<name>, how do I know that [B@19821f is a UTF-8 of some Korean word?

Kozlov, remember to accept some answers to your other questions. Surely somebody has helped you out with an answer to some of those questions .. — Wivani
– Wivani, Commented Aug 26, 2011 at 9:54

JB Nizet · Accepted Answer · 2011-08-26 08:35:41Z

1

A string contains characters. The encoding is irrelevant until you transform the string into bytes. This happens when you call String.getBytes(), or when you write the String to a stream (file, socket, whatever).

Make sure you use an OutputStreamWriter to write your XML string, and that you specify UTF-8 as charset when constructing this OutputStreamWriter. If you're using a dedicated marshalling API like JAXB, set the appropriate property so that the UTF-8 encoding is used, and the generated XML contains its encoding (in the <?xml ...?> header) . Without knowing which API you're using to generate your XML string, it's hard to be more helpful.

answered Aug 26, 2011 at 8:35

JB Nizet

694k94 gold badges1.3k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Kozlov Over a year ago

Hi Thanks a lot , i was using ByteArrayOutputStream byteOut = new ByteArrayOutputStream(); PrintWriter pr = new PrintWriter(byteOut); I think ByteArrayOutputStream byteOut = new ByteArrayOutputStream(); Writer writer = null; try { writer = new OutputStreamWriter(byteOut, "UTF-8"); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } PrintWriter pr = new PrintWriter(writer); will solve .Am I right?

Joachim Sauer · Accepted Answer · 2011-08-26 08:34:51Z

1

First: the code you posted to "change the string to UTF8" is wrong. You never want to use that (*).

If you parse XML (and the XML is correctly encoded) then you'll already get String values in Java that will have the correctly decoded values, so there is nothing else you need to do, just handle the String objects as normally.

^{(*) there are a few cases where you have to "undo" damage already done where this might be useful, but those cases are very rare and then it will usually not work correctly either.}

answered Aug 26, 2011 at 8:34

Joachim Sauer

309k59 gold badges567 silver badges624 bronze badges

2 Comments

tchrist Over a year ago

Great answer! Do you have any ideas why we see this sort of inappropriate manual encoding/decoding so often? Is it from being fuzzy on the notions of encoders/decoders bound to streams or of abstract Unicode characters, or what? The only times I’ve had to actually do this were on files erroneously containing mixed encoding that varied by line, or when dealing with raw databases. You see the wrong-headed manual approach more often in Python than you do in Java or Perl, but I have seen it crop up in all three, and I wonder why users keeping attempting it.

Joachim Sauer Over a year ago

As far as I know (and my perception might be wrong) the difference is that in Python (pre-3, can't talk about Perl) a string is actually more of a byte-array than anything else (pretty much the same way it's in C): you'll have to remember to use the correct encoding with it and so on. In Java, however, the encoding of String fixed and unchangeable. This means that you can treat it as if it were "pure Unicode" (which of course isn't true any more, since it's UTF-16 now): you only ever need to specify one encoding, for the conversion, the other is implicit.

Collectives™ on Stack Overflow

UTF-8 in XML constructed using Java

2 Answers 2

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related