1

Could anyone please help me out here. I want to know the difference in below two string formatting. I am trying to encode the string to UTF-8. which one is the correct method.

String string2 = new String(string1.getBytes("UTF-8"), "UTF-8")); 

OR

String string3 = new String(string1.getBytes(),"UTF-8"));

ALSO if I use above two code together i.e.

line 1 :string1 = new String(string1.getBytes("UTF-8"), "UTF-8")); 
line 2 :string1 = new String(string1.getBytes(),"UTF-8")); 

Will the value of string1 will be the same in both the lines?

PS: Purpose of doing all this is to send Japanese text in web service call. So I want to send it with UTF-8 encoding.

1
  • String hold Unicode text, all possible scripts. If you get tje bytes for some more restricted encodeng/Charset, then a lossy conversion happens, normally resulting in <?> or question marks ?. UTF-8 is a full Unicode charset. Commented Nov 1, 2022 at 12:45

3 Answers 3

3

According to the javadoc of String#getBytes(String charsetName):

Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

And the documentation of String(byte[] bytes, Charset charset)

Constructs a new String by decoding the specified array of bytes using the specified charset.

Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:

// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will 
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8")); 
Sign up to request clarification or add additional context in comments.

6 Comments

got your point. but for second case where i am using the same string which was encoded using UTF-8 encoding in line 1 to get default bytes. So ideally the line 2 should also preserve string giving same string value which i was getting in line 1. Is it ryt?
I know this is wrong way of writing code as first line was sufficient. Since i am working on a legacy code that is having both the lines of code as mentioned in my question. until and unless these two lines can cause major issue at receiver end. I don't want to change it.
No, the second line can still fail to behave in expected manner if the default charset is not UTF-8. Because in the second line, when you write string1.getBytes(), it will encode the string1 using the default charset, which may or may not be UTF-8. And in the same line when you create new string using UTF-8, it will fail if default charset is not UTF-8, because in that case, the bytes are not encoded in UTF-8
Thanks. Any better to do it? or should i just use line 1 and remove line 2 in current code.
my string contains Japanese character
|
3

String and char (two-bytes UTF-16) in java is for (Unicode) text.

When converting from and to byte[]s one needs the Charset (encoding) of those bytes.

Both String.getBytes() and new String(byte[]) are short cuts that use the default operating system encoding. That almost always is wrong for crossplatform usages.

So use

byte[] b = s.getBytes("UTF-8");
s = new String(b, "UTF-8");

Or better, not throwing an UnsupportedCharsetException:

byte[] b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);

(Android does not know StandardCharsets however.)

The same holds for InputStreamReader, OutputStreamWriter that bridge binary data (InputStream/OutputStream) and text (Reader, Writer).

Comments

-1

Please don't confuse yourself. "String" is usually used to refer to values in a datatype that stores text. In this case, java.lang.String.

Serialized text is a sequence of bytes created by applying a character encoding to a string. In this case, byte[].

There are no UTF-8-encoded strings in Java.

If your web service client library takes a string, pass it the string. If it lets you specify an encoding to use for serialization, pass it StandardCharsets.UTF_8 or equivalent.

If it doesn't take a string, then pass it string1.GetBytes(StandardCharsets.UTF_8) and use whatever other mechanism it provides to let you tell the recipient that the bytes are UTF-8-encoded text. Or, get a different client library.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.