Truncating Strings by Bytes

Question

I create the following for truncating a string in java to a new string with a given number of bytes.

        String truncatedValue = "";
        String currentValue = string;
        int pivotIndex = (int) Math.round(((double) string.length())/2);
        while(!truncatedValue.equals(currentValue)){
            currentValue = string.substring(0,pivotIndex);
            byte[] bytes = null;
            bytes = currentValue.getBytes(encoding);
            if(bytes==null){
                return string;
            }
            int byteLength = bytes.length;
            int newIndex =  (int) Math.round(((double) pivotIndex)/2);
            if(byteLength > maxBytesLength){
                pivotIndex = newIndex;
            } else if(byteLength < maxBytesLength){
                pivotIndex = pivotIndex + 1;
            } else {
                truncatedValue = currentValue;
            }
        }
        return truncatedValue;

This is the first thing that came to my mind, and I know I could improve on it. I saw another post that was asking a similar question there, but they were truncating Strings using the bytes instead of String.substring. I think I would rather use String.substring in my case.

EDIT: I just removed the UTF8 reference because I would rather be able to do this for different storage types as well.

I would rephrase your problem. You are trying to fit a string into a byte array that cannot be larger than maxUTF8BytesLength. You want to use UTF-8 for the encoding. You want to copy as much character as possible. Correct? — gawi
– gawi, Commented Aug 26, 2010 at 15:51
right, I would say that is correct. I also would like to do it efficiently. — stevebot
– stevebot, Commented Aug 26, 2010 at 16:04
I just edited the question to not reference UTF-8. Sorry about that, it was misleading. — stevebot
– stevebot, Commented Aug 26, 2010 at 16:09

Rex Kerr · Accepted Answer · 2014-08-26 00:55:48Z

14

Why not convert to bytes and walk forward--obeying UTF8 character boundaries as you do it--until you've got the max number, then convert those bytes back into a string?

Or you could just cut the original string if you keep track of where the cut should occur:

// Assuming that Java will always produce valid UTF8 from a string, so no error checking!
// (Is this always true, I wonder?)
public class UTF8Cutter {
  public static String cut(String s, int n) {
    byte[] utf8 = s.getBytes();
    if (utf8.length < n) n = utf8.length;
    int n16 = 0;
    int advance = 1;
    int i = 0;
    while (i < n) {
      advance = 1;
      if ((utf8[i] & 0x80) == 0) i += 1;
      else if ((utf8[i] & 0xE0) == 0xC0) i += 2;
      else if ((utf8[i] & 0xF0) == 0xE0) i += 3;
      else { i += 4; advance = 2; }
      if (i <= n) n16 += advance;
    }
    return s.substring(0,n16);
  }
}

^{Note: edited to fix bugs on 2014-08-25}

edited Aug 26, 2014 at 0:55

answered Aug 26, 2010 at 15:46

Rex Kerr

168k27 gold badges325 silver badges411 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

stevebot Over a year ago

I definitely could do that. Is there any reason why using String.substring is any worse? It seems like doing it the way you describe would have to account for all the code points, which isn't a whole lot of fun. (depending on your definition of fun :) ).

Rex Kerr Over a year ago

@stevebot - To be efficient, you need to take advantage of the known structure of the data. If you don't care about efficiency and want it to be easy, or you want to support every possible Java encoding without having to know what it is, your method seems reasonable enough.

Holger Over a year ago

Wouldn’t it be even more efficient to iterate over the String’s characters and predict their encoded length, instead of encoding the entire string, to iterate over the encoded bytes and reconstitute their character association? Similar to this, just with non-BMP character support and counting before doing substring like in your answer…

kan · Accepted Answer · 2015-08-05 10:42:42Z

8

The more sane solution is using decoder:

final Charset CHARSET = Charset.forName("UTF-8"); // or any other charset
final byte[] bytes = inputString.getBytes(CHARSET);
final CharsetDecoder decoder = CHARSET.newDecoder();
decoder.onMalformedInput(CodingErrorAction.IGNORE);
decoder.reset();
final CharBuffer decoded = decoder.decode(ByteBuffer.wrap(bytes, 0, limit));
final String outputString = decoded.toString();

edited Aug 5, 2015 at 10:42

answered Aug 5, 2015 at 9:17

kan

29.2k7 gold badges75 silver badges112 bronze badges

4 Comments

Holger Over a year ago

Cutting at an arbitrary byte index may create invalid encoded data, as a single character may use multiple bytes (especially with UTF-8). Worse, with other encodings it might produce wrong valid characters, which are not ignored. You could easily avoid this by first allocating a ByteBuffer with the desired size, then use it with a CharsetEncoder, which will automatically encode only as many valid characters as fit into the buffer, then decode the buffer to a String. Similar approach, but without the bug and even more efficient, as it won’t encode character beyond the intended limit.

Holger Over a year ago

See this answer. It does even eliminate the decoding step.

kan Over a year ago

@Holger My solution ignores truncated multibyte chars by CodingErrorAction.IGNORE. So it works fine. I am interested to see an example when it fails. However I agree, your solution looks neater and could be more performant.

Holger Over a year ago

Yes, for UTF-8 using CodingErrorAction.IGNORE will do the right thing. But the OP said “I would rather be able to do this for different storage types aswell” and for other encodings, tearing multibyte sequences apart may result in valid (but wrong) characters.

Zsolt Taskai · Accepted Answer · 2015-09-17 23:28:50Z

5

I think Rex Kerr's solution has 2 bugs.

First, it will truncate to limit+1 if a non-ASCII character is just before the limit. Truncating "123456789á1" will result in "123456789á" which is represented in 11 characters in UTF-8.
Second, I think he misinterpreted the UTF standard. https://en.wikipedia.org/wiki/UTF-8#Description shows that a 110xxxxx at the beginning of a UTF sequence tells us that the representation is 2 characters long (as opposed to 3). That's the reason his implementation usually doesn't use up all available space (as Nissim Avitan noted).

Please find my corrected version below:

public String cut(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return s;
    }
    int n16 = 0;
    boolean extraLong = false;
    int i = 0;
    while (i < charLimit) {
        // Unicode characters above U+FFFF need 2 words in utf16
        extraLong = ((utf8[i] & 0xF0) == 0xF0);
        if ((utf8[i] & 0x80) == 0) {
            i += 1;
        } else {
            int b = utf8[i];
            while ((b & 0x80) > 0) {
                ++i;
                b = b << 1;
            }
        }
        if (i <= charLimit) {
            n16 += (extraLong) ? 2 : 1;
        }
    }
    return s.substring(0, n16);
}

I still thought this was far from effective. So if you don't really need the String representation of the result and the byte array will do, you can use this:

private byte[] cutToBytes(String s, int charLimit) throws UnsupportedEncodingException {
    byte[] utf8 = s.getBytes("UTF-8");
    if (utf8.length <= charLimit) {
        return utf8;
    }
    if ((utf8[charLimit] & 0x80) == 0) {
        // the limit doesn't cut an UTF-8 sequence
        return Arrays.copyOf(utf8, charLimit);
    }
    int i = 0;
    while ((utf8[charLimit-i-1] & 0x80) > 0 && (utf8[charLimit-i-1] & 0x40) == 0) {
        ++i;
    }
    if ((utf8[charLimit-i-1] & 0x80) > 0) {
        // we have to skip the starter UTF-8 byte
        return Arrays.copyOf(utf8, charLimit-i-1);
    } else {
        // we passed all UTF-8 bytes
        return Arrays.copyOf(utf8, charLimit-i);
    }
}

Funny thing is that with a realistic 20-500 byte limit they perform pretty much the same IF you create a string from the byte array again.

Please note that both methods assume a valid utf-8 input which is a valid assumption after using Java's getBytes() function.

edited Sep 17, 2015 at 23:28

answered Jul 27, 2013 at 1:26

Zsolt Taskai

511 silver badge3 bronze badges

5 Comments

asalamon74 Over a year ago

You should also catch UnsupportedEncodingException at s.getBytes("UTF-8")

Zsolt Taskai Over a year ago

I don't see getBytes throwing anything. Although docs.oracle.com/javase/7/docs/api/java/lang/… says "The behavior of this method when this string cannot be encoded in the given charset is unspecified."

asalamon74 Over a year ago

The page you linked shows that it throws UnsupportedEncodingException: "public byte[] getBytes(String charsetName) throws UnsupportedEncodingException"

Zsolt Taskai Over a year ago

Thanks! Strange, I don't know what version I used when I posted this solution 2 years ago. Updating the code above.

Pikachu Over a year ago

Instead of providing the encoding name as a String you can use the Charset constants from StandardCharsets class because the String#getBytes(Charset charset) method does not throw UnsupportedEncodingException.

Ilya Lysenko · Accepted Answer · 2020-08-02 18:20:32Z

5

String s = "FOOBAR";

int limit = 3;
s = new String(s.getBytes(), 0, limit);

Result value of s:

FOO

edited Aug 2, 2020 at 18:20

answered May 23, 2013 at 14:39

Ilya Lysenko

1,90218 silver badges24 bronze badges

2 Comments

Martin Rust Over a year ago

When the MAX_LENGTH interrupts the byte array in the middle of a multi-byte sequence, then the resulting string ends with a "?". Example: s = "ää"; MAX_LENGTH = 3; result: "ä?" Given the simplicity of this code, however maybe in some situations this might be an option.

Martin Rust Over a year ago

correct my comment: MAX_LENGTH = 5 (why does the solution use MAX_LENGTH - 2?) Also note that as of Java 1.6, "UTF-8" should be replaced by StandardCharsets.UTF_8.

bmargulies · Accepted Answer · 2011-04-24 21:11:38Z

3

Use the UTF-8 CharsetEncoder, and encode until the output ByteBuffer contains as many bytes as you are willing to take, by looking for CoderResult.OVERFLOW.

answered Apr 24, 2011 at 21:11

bmargulies

101k40 gold badges196 silver badges327 bronze badges

Comments

Nissim Avitan · Accepted Answer · 2013-01-31 11:56:49Z

2

As noted, Peter Lawrey solution has major performance disadvantage (~3,500msc for 10,000 times), Rex Kerr was much better (~500msc for 10,000 times) but the result not was accurate - it cut much more than it needed (instead of remaining 4000 bytes it remainds 3500 for some example). attached here my solution (~250msc for 10,000 times) assuming that UTF-8 max length char in bytes is 4 (thanks WikiPedia):

public static String cutWord (String word, int dbLimit) throws UnsupportedEncodingException{
    double MAX_UTF8_CHAR_LENGTH = 4.0;
    if(word.length()>dbLimit){
        word = word.substring(0, dbLimit);
    }
    if(word.length() > dbLimit/MAX_UTF8_CHAR_LENGTH){
        int residual=word.getBytes("UTF-8").length-dbLimit;
        if(residual>0){
            int tempResidual = residual,start, end = word.length();
            while(tempResidual > 0){
                start = end-((int) Math.ceil((double)tempResidual/MAX_UTF8_CHAR_LENGTH));
                tempResidual = tempResidual - word.substring(start,end).getBytes("UTF-8").length;
                end=start;
            }
            word = word.substring(0, end);
        }
    }
    return word;
}

edited Jan 31, 2013 at 11:56

answered Jan 31, 2013 at 9:54

Nissim Avitan

213 bronze badges

1 Comment

Stefan L Over a year ago

Doesn't look like this solution prevents a trailing half surrogate pair? Second, in case getBytes().length would happen to be applied to both halves of a surrogate pair individually (not immediately obvious to me it never will), it'd also underestimate the the size of the UTF-8 representation of the pair as a whole, assuming the "replacement byte array" is a single byte. Third, the 4-byte UTF-8 code points all require a two-char surrogate pair in Java, so effectively the max is just 3 bytes per Java character.

Peter Lawrey · Accepted Answer · 2010-08-27 21:51:52Z

1

you could convert the string to bytes and convert just those bytes back to a string.

public static String substring(String text, int maxBytes) {
   StringBuilder ret = new StringBuilder();
   for(int i = 0;i < text.length(); i++) {
       // works out how many bytes a character takes, 
       // and removes these from the total allowed.
       if((maxBytes -= text.substring(i, i+1).getBytes().length) < 0) break;
       ret.append(text.charAt(i));
   }
   return ret.toString();
}

answered Aug 27, 2010 at 21:51

Peter Lawrey

535k83 gold badges770 silver badges1.2k bronze badges

3 Comments

Peter Lawrey Over a year ago

@nguyendat, there is lots of reasons this is not very performant. The main one would be the object creation for the substring() and getBytes() However, you would be surprised how much you can do in a milli-second and that is usually enough.

Stefan L Over a year ago

That method doesn't handle surrogate pairs properly, e.g. substring("\uD800\uDF30\uD800\uDF30", 4).getBytes("UTF-8").length will return 8, not 4. Half a surrogate pair is represented as a single-byte "?" by String.getBytes("UTF-8").

Hans Brende Over a year ago

@StefanL I posted a variant of this answer here which should handle surrogate pairs properly.

Baby Groot · Accepted Answer · 2014-01-06 09:08:18Z

0

By using below Regular Expression also you can remove leading and trailing white space of double byte character.

stringtoConvert = stringtoConvert.replaceAll("^[\\s　]*", "").replaceAll("[\\s　]*$", "");

edited Jan 6, 2014 at 9:08

Baby Groot

4,27939 gold badges55 silver badges70 bronze badges

answered Jan 6, 2014 at 8:52

Gokul Limbe

1

Comments

Matt McMinn · Accepted Answer · 2014-02-13 21:31:51Z

This is my :

private static final int FIELD_MAX = 2000;
private static final Charset CHARSET =  Charset.forName("UTF-8"); 

public String trancStatus(String status) {

    if (status != null && (status.getBytes(CHARSET).length > FIELD_MAX)) {
        int maxLength = FIELD_MAX;

        int left = 0, right = status.length();
        int index = 0, bytes = 0, sizeNextChar = 0;

        while (bytes != maxLength && (bytes > maxLength || (bytes + sizeNextChar < maxLength))) {

            index = left + (right - left) / 2;

            bytes = status.substring(0, index).getBytes(CHARSET).length;
            sizeNextChar = String.valueOf(status.charAt(index + 1)).getBytes(CHARSET).length;

            if (bytes < maxLength) {
                left = index - 1;
            } else {
                right = index + 1;
            }
        }

        return status.substring(0, index);

    } else {
        return status;
    }
}

Saúl Martínez Vidals · Accepted Answer · 2015-02-06 07:33:25Z

0

This one could not be the more efficient solution but works

public static String substring(String s, int byteLimit) {
    if (s.getBytes().length <= byteLimit) {
        return s;
    }

    int n = Math.min(byteLimit-1, s.length()-1);
    do {
        s = s.substring(0, n--);
    } while (s.getBytes().length > byteLimit);

    return s;
}

edited Feb 6, 2015 at 7:33

answered Feb 6, 2015 at 6:14

Saúl Martínez Vidals

6184 silver badges15 bronze badges

Comments

Hans Brende · Accepted Answer · 2016-12-10 08:26:17Z

0

I've improved upon Peter Lawrey's solution to accurately handle surrogate pairs. In addition, I optimized based on the fact that the maximum number of bytes per char in UTF-8 encoding is 3.

public static String substring(String text, int maxBytes) {
    for (int i = 0, len = text.length(); (len - i) * 3 > maxBytes;) {
        int j = text.offsetByCodePoints(i, 1);
        if ((maxBytes -= text.substring(i, j).getBytes(StandardCharsets.UTF_8).length) < 0)  
            return text.substring(0, i);
        i = j;
    }
    return text;
}

edited Dec 10, 2016 at 8:26

answered Dec 10, 2016 at 1:27

Hans Brende

8,8874 gold badges40 silver badges47 bronze badges

Comments

nafg · Accepted Answer · 2020-04-22 18:33:07Z

0

Binary search approach in scala:

private def bytes(s: String) = s.getBytes("UTF-8")

def truncateToByteLength(string: String, length: Int): String =
  if (length <= 0 || string.isEmpty) ""
  else {
    @tailrec
    def loop(badLen: Int, goodLen: Int, good: String): String = {
      assert(badLen > goodLen, s"""badLen is $badLen but goodLen is $goodLen ("$good")""")
      if (badLen == goodLen + 1) good
      else {
        val mid = goodLen + (badLen - goodLen) / 2
        val midStr = string.take(mid)
        if (bytes(midStr).length > length)
          loop(mid, goodLen, good)
        else
          loop(badLen, mid, midStr)
      }
    }

    loop(string.length * 2, 0, "")
  }

answered Apr 22, 2020 at 18:33

nafg

2,54427 silver badges25 bronze badges

Collectives™ on Stack Overflow

Truncating Strings by Bytes

12 Answers 12

3 Comments

4 Comments

5 Comments

2 Comments

Comments

1 Comment

3 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

3 Comments

4 Comments

5 Comments

2 Comments

Comments

1 Comment

3 Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related