0

Using Java I am constructing some XML. In the XML some nodes may have values which are in Korean language or some other language. After constructing, how do I make sure that my whole XML is in UTF-8 encoding? Do I need to explicitly change the string to UTF-8 by using something like:

string = new String(s.getBytes(), "UTF-8");

Or will the whole string be automatically in UTF-8?

Also if I get some XML with some UTF-8 like this <name>[B@19821f<name>, how do I know that [B@19821f is a UTF-8 of some Korean word?

1
  • Kozlov, remember to accept some answers to your other questions. Surely somebody has helped you out with an answer to some of those questions .. Commented Aug 26, 2011 at 9:54

2 Answers 2

1

A string contains characters. The encoding is irrelevant until you transform the string into bytes. This happens when you call String.getBytes(), or when you write the String to a stream (file, socket, whatever).

Make sure you use an OutputStreamWriter to write your XML string, and that you specify UTF-8 as charset when constructing this OutputStreamWriter. If you're using a dedicated marshalling API like JAXB, set the appropriate property so that the UTF-8 encoding is used, and the generated XML contains its encoding (in the <?xml ...?> header) . Without knowing which API you're using to generate your XML string, it's hard to be more helpful.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi Thanks a lot , i was using ByteArrayOutputStream byteOut = new ByteArrayOutputStream(); PrintWriter pr = new PrintWriter(byteOut); I think ByteArrayOutputStream byteOut = new ByteArrayOutputStream(); Writer writer = null; try { writer = new OutputStreamWriter(byteOut, "UTF-8"); } catch (UnsupportedEncodingException e) { // TODO Auto-generated catch block e.printStackTrace(); } PrintWriter pr = new PrintWriter(writer); will solve .Am I right?
1

First: the code you posted to "change the string to UTF8" is wrong. You never want to use that (*).

If you parse XML (and the XML is correctly encoded) then you'll already get String values in Java that will have the correctly decoded values, so there is nothing else you need to do, just handle the String objects as normally.

(*) there are a few cases where you have to "undo" damage already done where this might be useful, but those cases are very rare and then it will usually not work correctly either.

2 Comments

Great answer! Do you have any ideas why we see this sort of inappropriate manual encoding/decoding so often? Is it from being fuzzy on the notions of encoders/decoders bound to streams or of abstract Unicode characters, or what? The only times I’ve had to actually do this were on files erroneously containing mixed encoding that varied by line, or when dealing with raw databases. You see the wrong-headed manual approach more often in Python than you do in Java or Perl, but I have seen it crop up in all three, and I wonder why users keeping attempting it.
As far as I know (and my perception might be wrong) the difference is that in Python (pre-3, can't talk about Perl) a string is actually more of a byte-array than anything else (pretty much the same way it's in C): you'll have to remember to use the correct encoding with it and so on. In Java, however, the encoding of String fixed and unchangeable. This means that you can treat it as if it were "pure Unicode" (which of course isn't true any more, since it's UTF-16 now): you only ever need to specify one encoding, for the conversion, the other is implicit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.