String encoding (UTF-8) JAVA

Question

Could anyone please help me out here. I want to know the difference in below two string formatting. I am trying to encode the string to UTF-8. which one is the correct method.

String string2 = new String(string1.getBytes("UTF-8"), "UTF-8"));

OR

String string3 = new String(string1.getBytes(),"UTF-8"));

ALSO if I use above two code together i.e.

line 1 :string1 = new String(string1.getBytes("UTF-8"), "UTF-8")); 
line 2 :string1 = new String(string1.getBytes(),"UTF-8"));

Will the value of string1 will be the same in both the lines?

PS: Purpose of doing all this is to send Japanese text in web service call. So I want to send it with UTF-8 encoding.

String hold Unicode text, all possible scripts. If you get tje bytes for some more restricted encodeng/Charset, then a lossy conversion happens, normally resulting in <?> or question marks ?. UTF-8 is a full Unicode charset. — Joop Eggen
– Joop Eggen, Commented Nov 1, 2022 at 12:45

Aqeel Ashiq · Accepted Answer · 2018-03-28 14:38:47Z

3

According to the javadoc of String#getBytes(String charsetName):

Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array.

And the documentation of String(byte[] bytes, Charset charset)

Constructs a new String by decoding the specified array of bytes using the specified charset.

Thus getBytes() is opposite operation of String(byte []). The getBytes() encodes the string to bytes, and String(byte []) will decode the byte array and convert it to string. You will have to use same charset for both methods to preserve the actual string value. I.e. your second example is wrong:

// This is wrong because you are calling getBytes() with default charset
// But converting those bytes to string using UTF-8 encoding. This will 
// mostly work because default encoding is usually UTF-8, but it can fail
// so it is wrong.
new String(string1.getBytes(),"UTF-8"));

answered Mar 28, 2018 at 14:38

Aqeel Ashiq

2,2157 gold badges29 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

no_name22 Over a year ago

got your point. but for second case where i am using the same string which was encoded using UTF-8 encoding in line 1 to get default bytes. So ideally the line 2 should also preserve string giving same string value which i was getting in line 1. Is it ryt?

no_name22 Over a year ago

I know this is wrong way of writing code as first line was sufficient. Since i am working on a legacy code that is having both the lines of code as mentioned in my question. until and unless these two lines can cause major issue at receiver end. I don't want to change it.

Aqeel Ashiq Over a year ago

No, the second line can still fail to behave in expected manner if the default charset is not UTF-8. Because in the second line, when you write string1.getBytes(), it will encode the string1 using the default charset, which may or may not be UTF-8. And in the same line when you create new string using UTF-8, it will fail if default charset is not UTF-8, because in that case, the bytes are not encoded in UTF-8

no_name22 Over a year ago

Thanks. Any better to do it? or should i just use line 1 and remove line 2 in current code.

no_name22 Over a year ago

my string contains Japanese character

|

Joop Eggen · Accepted Answer · 2018-03-28 14:41:18Z

String and char (two-bytes UTF-16) in java is for (Unicode) text.

When converting from and to byte[]s one needs the Charset (encoding) of those bytes.

Both String.getBytes() and new String(byte[]) are short cuts that use the default operating system encoding. That almost always is wrong for crossplatform usages.

So use

byte[] b = s.getBytes("UTF-8");
s = new String(b, "UTF-8");

Or better, not throwing an UnsupportedCharsetException:

byte[] b = s.getBytes(StandardCharsets.UTF_8);
s = new String(b, StandardCharsets.UTF_8);

(Android does not know StandardCharsets however.)

The same holds for InputStreamReader, OutputStreamWriter that bridge binary data (InputStream/OutputStream) and text (Reader, Writer).

Tom Blodget · Accepted Answer · 2018-03-28 17:06:05Z

-1

Please don't confuse yourself. "String" is usually used to refer to values in a datatype that stores text. In this case, java.lang.String.

Serialized text is a sequence of bytes created by applying a character encoding to a string. In this case, byte[].

There are no UTF-8-encoded strings in Java.

If your web service client library takes a string, pass it the string. If it lets you specify an encoding to use for serialization, pass it StandardCharsets.UTF_8 or equivalent.

If it doesn't take a string, then pass it string1.GetBytes(StandardCharsets.UTF_8) and use whatever other mechanism it provides to let you tell the recipient that the bytes are UTF-8-encoded text. Or, get a different client library.

answered Mar 28, 2018 at 17:06

Tom Blodget

20.9k3 gold badges46 silver badges78 bronze badges

Collectives™ on Stack Overflow

String encoding (UTF-8) JAVA

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related