0

I am trying to convert a CharSequence to a UTF-8 encoded byte[] array.

And i've been having problems with it, so i was going to ask stackoverflow for help. And i was going to write a Java Fiddle to do it:

https://www.mycompiler.io/view/3MliN0HgwDD

Except the fiddle itself doesn't work:

import java.util.*;
import java.lang.*;
import java.io.*;
import java.nio.*;
import java.nio.charset;

// The main method must be in a class named "Main".
class Main {
    public static byte[] charSequenceToUtf8(final CharSequence input)
    {
        //char[] chars = new char[input.length];
        //for (int i=0; i<input.length; i++)
        //  chars[i] = input.charAt(i);

        CharBuffer charBuffer = CharBuffer.wrap(input);
        checkEquals(10, charBuffer.length(), "Charbuffer is wrong length");

        Charset cs = Charset.forName("UTF-8"); 
        ByteBuffer byteBuffer = cs.encode(charBuffer);
        checkEquals(10, byteBuffer.length(), "byteBuffer is wrong length");
        
        byte[] utf8 = byteBuffer.array();        
        checkEquals(10, utf8.length, "utf8 bytes is wrong length");
    }
    
    public static void checkEquals(int expected, int actual, String message)
    {
        if (expected == actual)
            return;
            
        String sExpected = String.valueOf(expected);
        String sActual = String.valueOf(actual);
        
        throw new Exception("Test failed. Expected "+sExpected+", Actual "+sActual+". "+message);
    }
    
    public static void main(String[] args) {
        test("AAAAAAAAAA"); //ten A's
    }
}

It seems that java.nio requires at least Java 7 ref. Which is why it is confusing to me that it doesn't work in Java 16:

enter image description here

So this bring up a lot of questions:

  • how can i convert a CharSequence to a byte[] array? 1
  • why does it not work in Java 16?

In the end, the actual bug is that trying to encode the string AAAAAAAAA returns an 11-element array:

CharSequence UTF-8 byte array
"AA" [65, 65]
"AAA" [65, 65, 65]
"AAAA" [65, 65, 65, 65]
"AAAAA" [65, 65, 65, 65, 65]
"AAAAAA" [65, 65, 65, 65, 65, 65]
"AAAAAAA" [65, 65, 65, 65, 65, 65, 65]
"AAAAAAAA" [65, 65, 65, 65, 65, 65, 65, 65]
"AAAAAAAAA" [65, 65, 65, 65, 65, 65, 65, 65, 65]
"AAAAAAAAAA" [65, 65, 65, 65, 65, 65, 65, 65, 65, 65, 0]

Why is the above code, that i stole from the linked question, failing of a string of 10 characters?

7
  • 3
    Looks like it's working to me. Don't know why, but it appears encoding "AAAAAAAAAA" returns a ByteBuffer whose capacity is 11, but its limit is set to 10. You're printing out the entire backing array without taking the limit into account. Commented Jul 8, 2022 at 21:24
  • 2
    Why are you doing it this way? What's wrong with just doing .getBytes(StandardCharsets.UTF_8)? Commented Jul 8, 2022 at 21:32
  • Typo in import java.nio.charset;, no class name Commented Jul 8, 2022 at 21:42
  • 1
    In addition to @DuncG comment, there is no reason to have import java.lang.*;. In Java, that import is implicitly present in all source files. Commented Jul 8, 2022 at 21:44
  • @Sweeper CharSequence doesn't have a .getBytes() method Commented Jul 9, 2022 at 0:19

1 Answer 1

8

First, note that if you have a String, then you can simply do:

byte[] bytes = theString.getBytes(StandardCharsets.UTF_8);

Or, even if you have a CharSequence, you can do:

byte[] bytes = theCharSequence.toString().getBytes(StandardCharsets.UTF_8);

That will potentially create a String copy of the CharSequence if it's not already a String, though it should be quickly garbage collected.

But regarding your question, you're not taking the ByteBuffer's limit (or position, though it's 0 in this case) into account. For whatever reason, encoding "AAAAAAAAAA" results in a buffer whose capacity is 11, but whose limit is 10. But the #array() method returns the entire backing array, regardless of the buffer's position or limit. This means you need to manually take the limit (and position) into account when converting the ByteBuffer to a byte[].

For example:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Arrays;

public class Main {

  public static void main(String[] args) throws Exception {
    for (int i = 1; i <= 10; i++) {
      String string = "A".repeat(i);

      CharBuffer chars = CharBuffer.wrap(string);
      ByteBuffer bytes = StandardCharsets.UTF_8.encode(chars);

      System.out.printf("%-10s | %s%n", string, Arrays.toString(toByteArray(bytes)));
    }
  }

  public static byte[] toByteArray(ByteBuffer buffer) {
    byte[] array = new byte[buffer.remaining()];
    buffer.get(buffer.position(), array);
    return array;
  }
}

Which will output:

A          | [65]
AA         | [65, 65]
AAA        | [65, 65, 65]
AAAA       | [65, 65, 65, 65]
AAAAA      | [65, 65, 65, 65, 65]
AAAAAA     | [65, 65, 65, 65, 65, 65]
AAAAAAA    | [65, 65, 65, 65, 65, 65, 65]
AAAAAAAA   | [65, 65, 65, 65, 65, 65, 65, 65]
AAAAAAAAA  | [65, 65, 65, 65, 65, 65, 65, 65, 65]
AAAAAAAAAA | [65, 65, 65, 65, 65, 65, 65, 65, 65, 65]

Note the above example copies a region of the buffer's backing array, though the original ByteBuffer should be quickly garbage collected. The only way to avoid copying the backing array, that I can think of, is to adapt your code to work with the ByteBuffer directly (if you only return the backing array, you lose the position/limit information). Or I suppose you could create a wrapper class.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.