21

I have a situation where I need to know the size of a String/encoding pair, in bytes, but cannot use the getBytes() method because 1) the String is very large and duplicating the String in a byte[] array would use a large amount of memory, but more to the point 2) getBytes() allocates a byte[] array based on the length of the String * the maximum possible bytes per character. So if I have a String with 1.5B characters and UTF-16 encoding, getBytes() will try to allocate a 3GB array and fail, since arrays are limited to 2^32 - X bytes (X is Java version specific).

So - is there some way to calculate the byte size of a String/encoding pair directly from the String object?

UPDATE:

Here's a working implementation of jtahlborn's answer:

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}
13
  • The length in bytes depends on your target encoding. For example, "test".getBytes("UTF-8") is 4 bytes, but "test".getBytes("UTF-16") is 10 bytes (yes, 10, try it). So you need to clarify your question a bit. Commented Nov 8, 2013 at 7:02
  • I would add that it is also dependant on the code point ("characters") you are encoding. For example, in UTF-16, certain code point uses 1 code unit, other uses 2 (a code unit is 16 bits long). UTF-8 can take anywhere from 1 to 4 bytes per character. Commented Nov 8, 2013 at 7:17
  • @brettw Sorry if I'm being dense, but yes, your comment is the point of the question: given a String and an encoding, how many bytes does encoding the String require? Rereading the question, that seems pretty clear to me - do you have any suggestions for rewording it? Commented Nov 8, 2013 at 7:30
  • @Francis the comment above applies to your comment as well, to the best of my ability to tell. Commented Nov 8, 2013 at 7:31
  • getByte does not create an array bigger then it needs to be. It creates an array of the correct size for the given string. It does not creates an array of length "length of the String * the maximum possible bytes per character". And string.length() does not return the number of characters in a string, it returns the number of code units. For UTF-16, a code unit is 16 bits, and the number of code units per character is either 1 or 2, it depends on the character. Therefore, either I don`t understand your second point in your question, or your assumption is not correct. Commented Nov 8, 2013 at 7:51

5 Answers 5

12

Simple, just write it to a dummy output stream:

class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

it's not only simple, but probably just as fast as the other "complex" answers.

Sign up to request clarification or add additional context in comments.

15 Comments

@elhefe - your version may compile, but it is incorrect. you don't want to use the offset in the calculation.
Whoops, fixed. Apparently only the write(byte[]) method was used by my tests.
@AminSuzani - changing _total to a long would be sufficient.
I am not exactly sure what is being saved here. The String is still going to be duplicated into a char array by the OutputStreamWriter (via the StreamEncoder.write((String str, int off, int len) method) before it tries to do the byte conversion.
But it doesn't solve the OP's problem. You are just replacing the the byte[] array allocation the OP was trying to get rid of with another (the char[] array) which will likely turn out around the same size. Of course, if I had a solution I would post it :).
|
4

Guava has an implementation according to this post:

Utf8.encodedLength()

Comments

2

The same using apache-commons libraries:

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}

Comments

1

Here's an apparently working implementation:

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class TestUnicode {

    private final static int ENCODE_CHUNK = 100;

    public static long bytesRequiredToEncode(final String s,
            final Charset encoding) {
        long count = 0;
        for (int i = 0; i < s.length(); ) {
            int end = i + ENCODE_CHUNK;
            if (end >= s.length()) {
                end = s.length();
            } else if (Character.isHighSurrogate(s.charAt(end))) {
                end++;
            }
            count += encoding.encode(s.substring(i, end)).remaining() + 1;
            i = end;
        }
        return count;
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            sb.appendCodePoint(11614);
            sb.appendCodePoint(1061122);
            sb.appendCodePoint(2065);
            sb.appendCodePoint(1064124);
        }
        Charset cs = StandardCharsets.UTF_8;

        System.out.println(bytesRequiredToEncode(new String(sb), cs));
        System.out.println(new String(sb).getBytes(cs).length);
    }
}

The output is:

1400
1400

In practice I'd increase ENCODE_CHUNK to 10MChars or so.

Probably slightly less efficient than brettw's answer, but simpler to implement.

1 Comment

This isn’t so bad, considering that the OutputStreamWriter of the other solution will also perform an actual encoding operation into a buffer, before passing it to the CountingOutputStream. The only disadvantage is that your solution allocates new ByteBuffer instances. When you fix that by implementing the standard encoding loop, you’ve got the fastest possible (generic) solution. See this answer for a cheap calculation specifically for UTF-8.
-2

Ok, this is extremely gross. I admit that, but this stuff is hidden by the JVM, so we have to dig a little. And sweat a little.

First, we want the actual char[] that backs a String without making a copy. To do this we have to use reflection to get at the 'value' field:

char[] chars = null;
for (Field field : String.class.getDeclaredFields()) {
    if ("value".equals(field.getName())) {
        field.setAccessible(true);
        chars = (char[]) field.get(string); // <--- got it!
        break;
    }
}

Next you need to implement a subclass of java.nio.ByteBuffer. Something like:

class MyByteBuffer extends ByteBuffer {
    int length;            
    // Your implementation here
};

Ignore all of the getters, implement all of the put methods like put(byte) and putChar(char) etc. Inside something like put(byte), increment length by 1, inside of put(byte[]) increment length by the array length. Get it? Everything that is put, you add the size of whatever it is to length. But you're not storing anything in your ByteBuffer, you're just counting and throwing away, so no space is taken. If you breakpoint the put methods, you can probably figure out which ones you actually need to implement. putFloat(float) is probably not used, for example.

Now for the grand finale, putting it all together:

MyByteBuffer bbuf = new MyByteBuffer();         // your "counting" buffer
CharBuffer cbuf = CharBuffer.wrap(chars);       // wrap your char array
Charset charset = Charset.forName("UTF-8");     // your charset goes here
CharsetEncoder encoder = charset.newEncoder();  // make a new encoder
encoder.encode(cbuf, bbuf, true);               // do it!
System.out.printf("Length: %d\n", bbuf.length); // pay me US$1,000,000

3 Comments

You can avoid the ugly reflection stuff, by simply calling CharBuffer.wrap(CharSequence) with the String itself. It will use the char[] from the String without copying (at least in Oracle JDK 7 Update 21).
Oh nice! I did not know that.
As @JoachimSauer told long ago, there is no need for this Reflection hack, so why does this answer still start with it? Starting with Java 9, this will fail as the internal array is not a char[] (letting aside alternative JRE implementations where it failed even earlier). Besides that, it’s strange to loop over getDeclaredFields() instead of just calling getDeclaredField("value"), but anyway. The main idea of your answer, creating a subclass of ByteBuffer in the application, is impossible.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.