0

I have a byte array. I need to concatenate a string with a delimiter to it. Then I want to get back the byte array. After all this logic, the output byte array is not equal to the input. In java:

This fails at the last line:

    @Test
    void test1() {
        byte[] initialBytes = RandomUtils.nextBytes(64);
        String initialString = new String(initialBytes, StandardCharsets.UTF_8);

        String concatenatedString = String.join("\t", "Pre", initialString);
        byte[] concatenatedStringToBytes = concatenatedString.getBytes(StandardCharsets.UTF_8);

        String concatenatedBytesBackToString = new String(concatenatedStringToBytes, StandardCharsets.UTF_8);

        int indexOfDelimeter = concatenatedBytesBackToString.indexOf("\t");
        String finalString = concatenatedBytesBackToString.substring(indexOfDelimeter + 1);

        byte[] finalBytes = finalString.getBytes(StandardCharsets.UTF_8);

        assertEquals(initialString, finalString);
        assertTrue(Arrays.equals(initialBytes, finalBytes));
    }
4
  • 3
    Two possibilities. Either your randomly generated bytes don't represent a valid String, or they represent a String which has more than one possible UTF-8 representation. Commented Jul 21, 2021 at 3:27
  • Isn't Arrays.equals(initialBytes, finalBytes) being false a very possible and valid result? Commented Jul 21, 2021 at 4:05
  • @sarveshseri - It is, but the strings corresponding to those byte arrays are equal. So I was confused why the byte[] are unequal. Commented Jul 21, 2021 at 4:27
  • Change it to dump both the byte arrays when they are unequal and have a look at them. Commented Jul 21, 2021 at 4:34

1 Answer 1

1

In Java, String values use UTF_16.

Since UTF_16 and UTF_8 have different character coverage, conversions from UTF_8 to UTF_16 can result in loss of information (if those non-matching characters are used). So, when you convert back to UTF_8, you will not get the same byte array.

public static void tryCharsetEncodingForRandomBytes() {
    byte[] initialBytes = getRandomBytes(64);
    String initialString = new String(initialBytes, StandardCharsets.UTF_8);

    byte[] finalBytes = initialString.getBytes(StandardCharsets.UTF_8);
    String finalString = new String(finalBytes, StandardCharsets.UTF_8);

    System.out.println(finalString.equals(initialString));
    System.out.println(initialBytes.length);
    System.out.println(finalBytes.length);
    System.out.println(Arrays.equals(initialBytes, finalBytes));
}

Output :

true
64
103
false

You will not encounter this loss of information when dealing with more popular characrers like AlphaNumerics which are commons in both UTF_16 and UTF_8 charsets.

public static void tryCharsetEncodingForAlphanumeric() {
    String alphaNumeric = "abcd1234";

    byte[] initialBytes = alphaNumeric.getBytes(StandardCharsets.UTF_8);
    String initialString = new String(initialBytes, StandardCharsets.UTF_8);

    byte[] finalBytes = initialString.getBytes(StandardCharsets.UTF_8);
    String finalString = new String(finalBytes, StandardCharsets.UTF_8);

    System.out.println(finalString.equals(initialString));
    System.out.println(initialBytes.length);
    System.out.println(finalBytes.length);
    System.out.println(Arrays.equals(initialBytes, finalBytes));
}

Output:

true
8
8
true

Which means that your tests will pass as long as you are dealing with common characters in UTF_8 and UTF_16.

public static void yourTestScenarioWithAlphaNumeric() {
    String alphaNumeric = "abcdefghijklmop1234567890";

    byte[] initialBytes = alphaNumeric.getBytes(StandardCharsets.UTF_8);
    String initialString = new String(initialBytes, StandardCharsets.UTF_8);

    String concatenatedString = String.join("\t", "Pre", initialString);
    byte[] concatenatedStringToBytes = concatenatedString.getBytes(StandardCharsets.UTF_8);

    String concatenatedBytesBackToString = new String(concatenatedStringToBytes, StandardCharsets.UTF_8);

    int indexOfDelimiter = concatenatedBytesBackToString.indexOf("\t");
    String finalString = concatenatedBytesBackToString.substring(indexOfDelimiter + 1);

    byte[] finalBytes = finalString.getBytes(StandardCharsets.UTF_8);

    System.out.println(finalString.equals(initialString));
    System.out.println(Arrays.equals(initialBytes, finalBytes));
}

Output:

true
true
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.