In Java, String values use UTF_16.
Since UTF_16 and UTF_8 have different character coverage, conversions from UTF_8 to UTF_16 can result in loss of information (if those non-matching characters are used). So, when you convert back to UTF_8, you will not get the same byte array.
public static void tryCharsetEncodingForRandomBytes() {
byte[] initialBytes = getRandomBytes(64);
String initialString = new String(initialBytes, StandardCharsets.UTF_8);
byte[] finalBytes = initialString.getBytes(StandardCharsets.UTF_8);
String finalString = new String(finalBytes, StandardCharsets.UTF_8);
System.out.println(finalString.equals(initialString));
System.out.println(initialBytes.length);
System.out.println(finalBytes.length);
System.out.println(Arrays.equals(initialBytes, finalBytes));
}
Output :
true
64
103
false
You will not encounter this loss of information when dealing with more popular characrers like AlphaNumerics which are commons in both UTF_16 and UTF_8 charsets.
public static void tryCharsetEncodingForAlphanumeric() {
String alphaNumeric = "abcd1234";
byte[] initialBytes = alphaNumeric.getBytes(StandardCharsets.UTF_8);
String initialString = new String(initialBytes, StandardCharsets.UTF_8);
byte[] finalBytes = initialString.getBytes(StandardCharsets.UTF_8);
String finalString = new String(finalBytes, StandardCharsets.UTF_8);
System.out.println(finalString.equals(initialString));
System.out.println(initialBytes.length);
System.out.println(finalBytes.length);
System.out.println(Arrays.equals(initialBytes, finalBytes));
}
Output:
true
8
8
true
Which means that your tests will pass as long as you are dealing with common characters in UTF_8 and UTF_16.
public static void yourTestScenarioWithAlphaNumeric() {
String alphaNumeric = "abcdefghijklmop1234567890";
byte[] initialBytes = alphaNumeric.getBytes(StandardCharsets.UTF_8);
String initialString = new String(initialBytes, StandardCharsets.UTF_8);
String concatenatedString = String.join("\t", "Pre", initialString);
byte[] concatenatedStringToBytes = concatenatedString.getBytes(StandardCharsets.UTF_8);
String concatenatedBytesBackToString = new String(concatenatedStringToBytes, StandardCharsets.UTF_8);
int indexOfDelimiter = concatenatedBytesBackToString.indexOf("\t");
String finalString = concatenatedBytesBackToString.substring(indexOfDelimiter + 1);
byte[] finalBytes = finalString.getBytes(StandardCharsets.UTF_8);
System.out.println(finalString.equals(initialString));
System.out.println(Arrays.equals(initialBytes, finalBytes));
}
Output:
true
true
String, or they represent aStringwhich has more than one possible UTF-8 representation.Arrays.equals(initialBytes, finalBytes)being false a very possible and valid result?