Byte Array, when converted to string, then concatenated, returns equal String but unequal byte array

Question

I have a byte array. I need to concatenate a string with a delimiter to it. Then I want to get back the byte array. After all this logic, the output byte array is not equal to the input. In java:

This fails at the last line:

    @Test
    void test1() {
        byte[] initialBytes = RandomUtils.nextBytes(64);
        String initialString = new String(initialBytes, StandardCharsets.UTF_8);

        String concatenatedString = String.join("\t", "Pre", initialString);
        byte[] concatenatedStringToBytes = concatenatedString.getBytes(StandardCharsets.UTF_8);

        String concatenatedBytesBackToString = new String(concatenatedStringToBytes, StandardCharsets.UTF_8);

        int indexOfDelimeter = concatenatedBytesBackToString.indexOf("\t");
        String finalString = concatenatedBytesBackToString.substring(indexOfDelimeter + 1);

        byte[] finalBytes = finalString.getBytes(StandardCharsets.UTF_8);

        assertEquals(initialString, finalString);
        assertTrue(Arrays.equals(initialBytes, finalBytes));
    }

Two possibilities. Either your randomly generated bytes don't represent a valid String, or they represent a String which has more than one possible UTF-8 representation. — Dawood ibn Kareem
– Dawood ibn Kareem, Commented Jul 21, 2021 at 3:27
Isn't Arrays.equals(initialBytes, finalBytes) being false a very possible and valid result? — sarveshseri
– sarveshseri, Commented Jul 21, 2021 at 4:05
@sarveshseri - It is, but the strings corresponding to those byte arrays are equal. So I was confused why the byte[] are unequal. — user1430186
– user1430186, Commented Jul 21, 2021 at 4:27
Change it to dump both the byte arrays when they are unequal and have a look at them. — tgdavies
– tgdavies, Commented Jul 21, 2021 at 4:34

sarveshseri · Accepted Answer · 2021-07-21 04:35:22Z

In Java, String values use UTF_16.

Since UTF_16 and UTF_8 have different character coverage, conversions from UTF_8 to UTF_16 can result in loss of information (if those non-matching characters are used). So, when you convert back to UTF_8, you will not get the same byte array.

public static void tryCharsetEncodingForRandomBytes() {
    byte[] initialBytes = getRandomBytes(64);
    String initialString = new String(initialBytes, StandardCharsets.UTF_8);

    byte[] finalBytes = initialString.getBytes(StandardCharsets.UTF_8);
    String finalString = new String(finalBytes, StandardCharsets.UTF_8);

    System.out.println(finalString.equals(initialString));
    System.out.println(initialBytes.length);
    System.out.println(finalBytes.length);
    System.out.println(Arrays.equals(initialBytes, finalBytes));
}

Output :

true
64
103
false

You will not encounter this loss of information when dealing with more popular characrers like AlphaNumerics which are commons in both UTF_16 and UTF_8 charsets.

public static void tryCharsetEncodingForAlphanumeric() {
    String alphaNumeric = "abcd1234";

    byte[] initialBytes = alphaNumeric.getBytes(StandardCharsets.UTF_8);
    String initialString = new String(initialBytes, StandardCharsets.UTF_8);

    byte[] finalBytes = initialString.getBytes(StandardCharsets.UTF_8);
    String finalString = new String(finalBytes, StandardCharsets.UTF_8);

    System.out.println(finalString.equals(initialString));
    System.out.println(initialBytes.length);
    System.out.println(finalBytes.length);
    System.out.println(Arrays.equals(initialBytes, finalBytes));
}

Output:

true
8
8
true

Which means that your tests will pass as long as you are dealing with common characters in UTF_8 and UTF_16.

public static void yourTestScenarioWithAlphaNumeric() {
    String alphaNumeric = "abcdefghijklmop1234567890";

    byte[] initialBytes = alphaNumeric.getBytes(StandardCharsets.UTF_8);
    String initialString = new String(initialBytes, StandardCharsets.UTF_8);

    String concatenatedString = String.join("\t", "Pre", initialString);
    byte[] concatenatedStringToBytes = concatenatedString.getBytes(StandardCharsets.UTF_8);

    String concatenatedBytesBackToString = new String(concatenatedStringToBytes, StandardCharsets.UTF_8);

    int indexOfDelimiter = concatenatedBytesBackToString.indexOf("\t");
    String finalString = concatenatedBytesBackToString.substring(indexOfDelimiter + 1);

    byte[] finalBytes = finalString.getBytes(StandardCharsets.UTF_8);

    System.out.println(finalString.equals(initialString));
    System.out.println(Arrays.equals(initialBytes, finalBytes));
}

Output:

true
true

Collectives™ on Stack Overflow

Byte Array, when converted to string, then concatenated, returns equal String but unequal byte array

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related