Get size of String w/ encoding in bytes without converting to byte[]

Question

I have a situation where I need to know the size of a String/encoding pair, in bytes, but cannot use the getBytes() method because 1) the String is very large and duplicating the String in a byte[] array would use a large amount of memory, but more to the point 2) getBytes() allocates a byte[] array based on the length of the String * the maximum possible bytes per character. So if I have a String with 1.5B characters and UTF-16 encoding, getBytes() will try to allocate a 3GB array and fail, since arrays are limited to 2^32 - X bytes (X is Java version specific).

So - is there some way to calculate the byte size of a String/encoding pair directly from the String object?

UPDATE:

Here's a working implementation of jtahlborn's answer:

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}

The length in bytes depends on your target encoding. For example, "test".getBytes("UTF-8") is 4 bytes, but "test".getBytes("UTF-16") is 10 bytes (yes, 10, try it). So you need to clarify your question a bit. — brettw
– brettw, Commented Nov 8, 2013 at 7:02
I would add that it is also dependant on the code point ("characters") you are encoding. For example, in UTF-16, certain code point uses 1 code unit, other uses 2 (a code unit is 16 bits long). UTF-8 can take anywhere from 1 to 4 bytes per character. — Francis
– Francis, Commented Nov 8, 2013 at 7:17
@brettw Sorry if I'm being dense, but yes, your comment is the point of the question: given a String and an encoding, how many bytes does encoding the String require? Rereading the question, that seems pretty clear to me - do you have any suggestions for rewording it? — elhefe
– elhefe, Commented Nov 8, 2013 at 7:30
@Francis the comment above applies to your comment as well, to the best of my ability to tell. — elhefe
– elhefe, Commented Nov 8, 2013 at 7:31
getByte does not create an array bigger then it needs to be. It creates an array of the correct size for the given string. It does not creates an array of length "length of the String * the maximum possible bytes per character". And string.length() does not return the number of characters in a string, it returns the number of code units. For UTF-16, a code unit is 16 bits, and the number of code units per character is either 1 or 2, it depends on the character. Therefore, either I don`t understand your second point in your question, or your assumption is not correct. — Francis
– Francis, Commented Nov 8, 2013 at 7:51

jtahlborn · Accepted Answer · 2016-01-19 19:01:20Z

12

Simple, just write it to a dummy output stream:

class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

it's not only simple, but probably just as fast as the other "complex" answers.

edited Jan 19, 2016 at 19:01

answered Nov 8, 2013 at 19:43

jtahlborn

53.8k5 gold badges80 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

jtahlborn Over a year ago

@elhefe - your version may compile, but it is incorrect. you don't want to use the offset in the calculation.

elhefe Over a year ago

Whoops, fixed. Apparently only the write(byte[]) method was used by my tests.

jtahlborn Over a year ago

@AminSuzani - changing _total to a long would be sufficient.

Gareth Over a year ago

I am not exactly sure what is being saved here. The String is still going to be duplicated into a char array by the OutputStreamWriter (via the StreamEncoder.write((String str, int off, int len) method) before it tries to do the byte conversion.

Gareth Over a year ago

But it doesn't solve the OP's problem. You are just replacing the the byte[] array allocation the OP was trying to get rid of with another (the char[] array) which will likely turn out around the same size. Of course, if I had a solution I would post it :).

|

Caio Cunha · Accepted Answer · 2020-04-13 21:24:12Z

4

Guava has an implementation according to this post:

Utf8.encodedLength()

answered Apr 13, 2020 at 21:24

Caio Cunha

23.4k6 gold badges81 silver badges74 bronze badges

Comments

30thh · Accepted Answer · 2017-10-30 14:56:30Z

2

The same using apache-commons libraries:

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}

answered Oct 30, 2017 at 14:56

30thh

11.4k7 gold badges38 silver badges47 bronze badges

Comments

elhefe · Accepted Answer · 2013-11-08 19:23:57Z

1

Here's an apparently working implementation:

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;

public class TestUnicode {

    private final static int ENCODE_CHUNK = 100;

    public static long bytesRequiredToEncode(final String s,
            final Charset encoding) {
        long count = 0;
        for (int i = 0; i < s.length(); ) {
            int end = i + ENCODE_CHUNK;
            if (end >= s.length()) {
                end = s.length();
            } else if (Character.isHighSurrogate(s.charAt(end))) {
                end++;
            }
            count += encoding.encode(s.substring(i, end)).remaining() + 1;
            i = end;
        }
        return count;
    }

    public static void main(String[] args) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 100; i++) {
            sb.appendCodePoint(11614);
            sb.appendCodePoint(1061122);
            sb.appendCodePoint(2065);
            sb.appendCodePoint(1064124);
        }
        Charset cs = StandardCharsets.UTF_8;

        System.out.println(bytesRequiredToEncode(new String(sb), cs));
        System.out.println(new String(sb).getBytes(cs).length);
    }
}

The output is:

1400
1400

In practice I'd increase ENCODE_CHUNK to 10MChars or so.

Probably slightly less efficient than brettw's answer, but simpler to implement.

answered Nov 8, 2013 at 19:23

elhefe

3,5344 gold badges33 silver badges49 bronze badges

1 Comment

Holger Over a year ago

This isn’t so bad, considering that the OutputStreamWriter of the other solution will also perform an actual encoding operation into a buffer, before passing it to the CountingOutputStream. The only disadvantage is that your solution allocates new ByteBuffer instances. When you fix that by implementing the standard encoding loop, you’ve got the fastest possible (generic) solution. See this answer for a cheap calculation specifically for UTF-8.

brettw · Accepted Answer · 2013-11-08 09:08:30Z

-2

Ok, this is extremely gross. I admit that, but this stuff is hidden by the JVM, so we have to dig a little. And sweat a little.

First, we want the actual char[] that backs a String without making a copy. To do this we have to use reflection to get at the 'value' field:

char[] chars = null;
for (Field field : String.class.getDeclaredFields()) {
    if ("value".equals(field.getName())) {
        field.setAccessible(true);
        chars = (char[]) field.get(string); // <--- got it!
        break;
    }
}

Next you need to implement a subclass of java.nio.ByteBuffer. Something like:

class MyByteBuffer extends ByteBuffer {
    int length;            
    // Your implementation here
};

Ignore all of the getters, implement all of the put methods like put(byte) and putChar(char) etc. Inside something like put(byte), increment length by 1, inside of put(byte[]) increment length by the array length. Get it? Everything that is put, you add the size of whatever it is to length. But you're not storing anything in your ByteBuffer, you're just counting and throwing away, so no space is taken. If you breakpoint the put methods, you can probably figure out which ones you actually need to implement. putFloat(float) is probably not used, for example.

Now for the grand finale, putting it all together:

MyByteBuffer bbuf = new MyByteBuffer();         // your "counting" buffer
CharBuffer cbuf = CharBuffer.wrap(chars);       // wrap your char array
Charset charset = Charset.forName("UTF-8");     // your charset goes here
CharsetEncoder encoder = charset.newEncoder();  // make a new encoder
encoder.encode(cbuf, bbuf, true);               // do it!
System.out.printf("Length: %d\n", bbuf.length); // pay me US$1,000,000

edited Nov 8, 2013 at 9:08

answered Nov 8, 2013 at 8:27

brettw

11.1k2 gold badges47 silver badges62 bronze badges

3 Comments

Joachim Sauer Over a year ago

You can avoid the ugly reflection stuff, by simply calling CharBuffer.wrap(CharSequence) with the String itself. It will use the char[] from the String without copying (at least in Oracle JDK 7 Update 21).

brettw Over a year ago

Oh nice! I did not know that.

Holger Over a year ago

As @JoachimSauer told long ago, there is no need for this Reflection hack, so why does this answer still start with it? Starting with Java 9, this will fail as the internal array is not a char[] (letting aside alternative JRE implementations where it failed even earlier). Besides that, it’s strange to loop over getDeclaredFields() instead of just calling getDeclaredField("value"), but anyway. The main idea of your answer, creating a subclass of ByteBuffer in the application, is impossible.

Collectives™ on Stack Overflow

Get size of String w/ encoding in bytes without converting to byte[]

5 Answers 5

15 Comments

Comments

Comments

1 Comment

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

15 Comments

Comments

Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related