3

I have this problem, I receive a String in a method that in database must be limited to 200(Varchar), with certain characters although the length of the String is less than 200, apparently the bytes length is more than 200, so I tried to make this:

Get the bytes length of the String

byte[] nameBytes = name.getBytes("UTF-8");

then if nameBytes.length > 200 I try to create a new String with a subarray of the original nameBytes like this:

name = new String(Arrays.copyOfRange(nameBytes, 0, 200), "UTF-8");

I am sure that Arrays.copyOfRange(nameBytes, 0, 200) is returning an array of length 200, but for some reason when I create the new String, this revision name.getBytes("UTF-8").length gives me 201, so I dont know why is adding one more byte.

There is something I am doing wrong? or there is a way to be sure o creating an array of the same length of the char array?

Thanks in advance.

4
  • Bytes are not characters. UTF-8 stores information in 1-4 bytes. Commented Nov 27, 2015 at 18:03
  • 1
    Does your database limit the number of bytes, or the number of characters? Which DBMS is it anyway? Commented Nov 27, 2015 at 18:03
  • @SamM There is a way to know the number of characters? I guess a String save characteres, right? Commented Nov 27, 2015 at 18:15
  • @Thomas Is DB2, I guess it is limiting by bytes, but I am not sure, because for example with string.length() I get number of characteres I guess and in this case is less than 150, but the getBytes function shows more than 201 and it is marking the error. Commented Nov 27, 2015 at 18:18

1 Answer 1

1

First some exemples:



        String cs;
        String name = "façade";
        byte[] nameBytes;        

        System.out.println(String.format("String '%s': %d", name, name.length()));
        cs = "UTF-8";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));
        cs = "UTF-16BE";
        nameBytes = name.getBytes(Charset.forName(cs));
        System.out.println(String.format("%s: %d / %d", cs, nameBytes.length, new String(nameBytes, cs).length()));

with the output:



    String 'façade': 6  ---> 6 characters with one outside ASCII range
    UTF-8: 7 / 6 ---> 'ç' requires 2 bytes, the others only one
    UTF-16: 14 / 6 ---> 2 x 6 bytes for code points + 2 bytes for BOM
    UTF-16BE: 12 / 6 ---> no need to embedded the BOM here => 2 x 6 bytes are enough

Comments:

  • always specify a charset, i.e. in both ways
  • about BOM, see Byte order mark
  • dixit Unicode Character Representations: The char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities.

The issue here is about the charset used in your database. If it's UTF-8, then you would have to check character by character when you hit the 200 bytes limit. With UTF-8, you can't cut the string on an arbitrary byte number: it can be in the middle of any 2 bytes character. The result is then unpredictable.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.