create a string from byte array does not return same length

Question

I have this problem, I receive a String in a method that in database must be limited to 200(Varchar), with certain characters although the length of the String is less than 200, apparently the bytes length is more than 200, so I tried to make this:

Get the bytes length of the String

byte[] nameBytes = name.getBytes("UTF-8");

then if nameBytes.length > 200 I try to create a new String with a subarray of the original nameBytes like this:

name = new String(Arrays.copyOfRange(nameBytes, 0, 200), "UTF-8");

I am sure that Arrays.copyOfRange(nameBytes, 0, 200) is returning an array of length 200, but for some reason when I create the new String, this revision name.getBytes("UTF-8").length gives me 201, so I dont know why is adding one more byte.

There is something I am doing wrong? or there is a way to be sure o creating an array of the same length of the char array?

Thanks in advance.

Bytes are not characters. UTF-8 stores information in 1-4 bytes. — Sam McCreery
– Sam McCreery, Commented Nov 27, 2015 at 18:03
Does your database limit the number of bytes, or the number of characters? Which DBMS is it anyway? — Thomas
– Thomas, Commented Nov 27, 2015 at 18:03
@SamM There is a way to know the number of characters? I guess a String save characteres, right? — John B
– John B, Commented Nov 27, 2015 at 18:15
@Thomas Is DB2, I guess it is limiting by bytes, but I am not sure, because for example with string.length() I get number of characteres I guess and in this case is less than 150, but the getBytes function shows more than 201 and it is marking the error. — John B
– John B, Commented Nov 27, 2015 at 18:18

atao · Accepted Answer · 2015-11-27 20:36:17Z

First some exemples:



        String cs;
        String name = "façade";
        byte[] nameBytes;        

        System.out.println(String.format("String '%s': %d", name, name.length()));
        cs = "UTF-8";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16BE";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));

with the output:



    String 'façade': 6  ---> 6 characters with one outside ASCII range
    UTF-8: 7 / 6 ---> 'ç' requires 2 bytes, the others only one
    UTF-16: 14 / 6 ---> 2 x 6 bytes for code points + 2 bytes for BOM
    UTF-16BE: 12 / 6 ---> no need to embedded the BOM here => 2 x 6 bytes are enough

Comments:

always specify a charset, i.e. in both ways
about BOM, see Byte order mark
dixit Unicode Character Representations: The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities.

The issue here is about the charset used in your database. If it's UTF-8, then you would have to check character by character when you hit the 200 bytes limit. With UTF-8, you can't cut the string on an arbitrary byte number: it can be in the middle of any 2 bytes character. The result is then unpredictable.

Collectives™ on Stack Overflow

create a string from byte array does not return same length

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related