What are some ways to avoid String.substring from returning substring with invalid unicode character?

Question

Recently, only I notice that, it is possible for substring to return string with invalid unicode character.

For instance

public class Main {

    public static void main(String[] args) {
        String text = "🥦_Salade verte";

        /* We should avoid using endIndex = 1, as it will cause an invalid character in the returned substring. */
        // 1 : ?
        System.out.println("1 : " + text.substring(0, 1));

        // 2 : 🥦
        System.out.println("2 : " + text.substring(0, 2));

        // 3 : 🥦_
        System.out.println("3 : " + text.substring(0, 3));

        // 4 : 🥦_S
        System.out.println("4 : " + text.substring(0, 4));
    }
}

I was wondering, when trimming a long string with String.substring, what are some good ways to avoid the returned substring from containing invalid unicode?

I altered your code to use an underscore instead of the first SPACE, for clarity. See code run live at IdeOne.com. — Basil Bourque
– Basil Bourque, Commented Dec 1, 2021 at 3:29
@BasilBourque Thanks. I amend my sample code to make the result clearer. — Cheok Yan Cheng
– Cheok Yan Cheng, Commented Dec 1, 2021 at 3:37

Basil Bourque · Accepted Answer · 2021-12-02 07:58:14Z

16

`char` obsolete

The char type has been legacy since Java 2, essentially broken. As a 16-bit value, char is physically incapable of representing most characters.

Your discovery suggests that the String#substring command is char based. Hence the problem shown in your code.

Code point

Instead, use code point integer numbers when working with individual characters.

int[] codePoints = "🥦_Salade".codePoints().toArray() ;

[129382, 95, 83, 97, 108, 97, 100, 101]

Extract the first character’s code point.

int codePoint = codePoints[ 0 ] ;

129382

Make a single-character String object for that code point.

String firstCharacter = Character.toString( codePoint ) ;

🥦

You can grab a subset of that int array of code points.

int[] firstFewCodePoints = Arrays.copyOfRange( codePoints , 0 , 3 ) ;

And make a String object from those code points.

String s = 
    Arrays
        .stream( firstFewCodePoints ) 
        .collect( StringBuilder::new , StringBuilder::appendCodePoint , StringBuilder::append )
        .toString();

🥦_S

Or we can use a constructor of String to take a subset of the array.

String result = new String( codePoints , 0 , 3 ) ;

🥦_S

See this code run live at IdeOne.com.

edited Dec 2, 2021 at 7:58

answered Dec 1, 2021 at 3:39

Basil Bourque

347k130 gold badges950 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Cheok Yan Cheng Over a year ago

Nice workaround. But, I was wondering, is there a way to know what should be the to value for Arrays.copyOfRange, so that the resultant string length will not exceed maxLength? The old way of doing is via string.substring(0, maxLength), but that may resultant invalid unicode code at the end of string.

Alex Veleshko Over a year ago

Char is not obsolete and will never be. It's just no one gave any promises that a single char will correspond to a single Unicode code point.

MC Emperor · Accepted Answer · 2021-12-08 09:26:11Z

6

The answer by Basil nicely shows that you should work with code points instead of chars.

A String does not store Unicode code points internally, so there is no way to know which characters belong together forming a Unicode code point, without inspecting the actual contents of the string.

Unicode-aware substring

Here is a Unicode-aware substring method. Since codePoints() returns an IntStream, we can utilize the skip and limit methods to extract a portion of the string.

public static String unicodeSubstring(String string, int beginIndex, int endIndex) {
    int length = endIndex - beginIndex;
    int[] codePoints = string.codePoints()
        .skip(beginIndex)
        .limit(length)
        .toArray();
    return new String(codePoints, 0, codePoints.length);
}

This is what happens in the abovementioned snippet of code. We stream over the Unicode code points, skipping the first beginIndex bytes and limiting the stream to endIndex − beginIndex, and then convertb to int[]. The result is that the int array contains all Unicode code points from beginIndex up to endIndex.

At last, the String class contains a nice constructor to construct a String from an int[] with code points, so we use it to get the String.

Of course, you could tweak the method to be a little more strict by rejecting out-of-bounds values:

if (endIndex < beginIndex) {
    throw new IllegalArgumentException("endIndex < beginIndex");
}
int length = endIndex - beginIndex;
int[] codePoints = string.codePoints()
    .skip(beginIndex)
    .limit(length)
    .toArray();
if (codePoints.length < length) {
    throw new IllegalArgumentException(
        "begin %s, end %s, length %s".formatted(beginIndex, endIndex, codePoints.length)
    );
}
return new String(codePoints, 0, codePoints.length);

Online demo

edited Dec 8, 2021 at 9:26

answered Dec 2, 2021 at 9:26

MC Emperor

23.3k16 gold badges90 silver badges138 bronze badges

1 Comment

Rangi Keen Mar 7 at 20:21

Rather than manually checking codepoints.length in your method, you could just pass length in to the String constructor which will throw if it is out of bounds. See also stackoverflow.com/a/55674580/2506021 for an alternative implementation that looks like it will throw exceptions for out of bounds values without additional checks.

Cheok Yan Cheng · Accepted Answer · 2021-12-03 02:26:40Z

I would like to provide another point of view, on how to implement substring which is able to guarantee the returned string contains valid unicode.

Unlike answer provided by @MC Emperor, my code treats

🥦 length as 2 (Instead of 1 by @MC Emperor)

This is important, to ensure the new function's behavior will resemble as close as old String.substring.

public static int length(String string) {
    if (string == null) {
        return 0;
    }
    return string.length();
}

public static String limitLength(String string, int maxLength) {
    int stringLength = length(string);

    if (stringLength <= maxLength) {
        return string;
    }

    List<Integer> codePointList = new ArrayList<>();

    for (int offset = 0; offset < maxLength; ) {
        final int codePoint = string.codePointAt(offset);

        final int charCount = Character.charCount(codePoint);

        if ((offset + charCount) > maxLength) {
            break;
        }

        codePointList.add(codePoint);

        offset += charCount;
    }

    int[] codePoints = new int[codePointList.size()];

    for (int i = 0; i < codePoints.length; i++)
    {
        codePoints[i] = codePointList.get(i);
    }

    String result = new String(codePoints, 0, Math.min(maxLength, codePoints.length));

    return result;
}

The code performance might not be efficient, if the maxLength value is large. But, I can't think of a much better way now. If you know a better way, feel free to amend the answer.

Alex Veleshko · Accepted Answer · 2021-12-19 10:11:39Z

0

It is by design. Java provides many ways to extract individual Unicode code points from a string if it's necessary: see the Oracle tutorial.

However, most of the time it's not needed since you get the string index from a method like String.indexOf(String s) or Matcher.start(). In this case the resulting index won't point in the middle of a code point (as long as the argument s is a valid Unicode string).

It's even more common to work with regular expressions where string indexes don't come up altogether.

answered Dec 19, 2021 at 10:11

Alex Veleshko

1,2467 silver badges25 bronze badges

Collectives™ on Stack Overflow

What are some ways to avoid String.substring from returning substring with invalid unicode character?

4 Answers 4

`char` obsolete

Code point

2 Comments

Unicode-aware substring

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

char obsolete

Code point

2 Comments

Unicode-aware substring

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`char` obsolete