21

Recently, only I notice that, it is possible for substring to return string with invalid unicode character.

For instance

public class Main {

    public static void main(String[] args) {
        String text = "🥦_Salade verte";

        /* We should avoid using endIndex = 1, as it will cause an invalid character in the returned substring. */
        // 1 : ?
        System.out.println("1 : " + text.substring(0, 1));

        // 2 : 🥦
        System.out.println("2 : " + text.substring(0, 2));

        // 3 : 🥦_
        System.out.println("3 : " + text.substring(0, 3));

        // 4 : 🥦_S
        System.out.println("4 : " + text.substring(0, 4));
    }
}

I was wondering, when trimming a long string with String.substring, what are some good ways to avoid the returned substring from containing invalid unicode?

2
  • I altered your code to use an underscore instead of the first SPACE, for clarity. See code run live at IdeOne.com. Commented Dec 1, 2021 at 3:29
  • @BasilBourque Thanks. I amend my sample code to make the result clearer. Commented Dec 1, 2021 at 3:37

4 Answers 4

16

char obsolete

The char type has been legacy since Java 2, essentially broken. As a 16-bit value, char is physically incapable of representing most characters.

Your discovery suggests that the String#substring command is char based. Hence the problem shown in your code.

Code point

Instead, use code point integer numbers when working with individual characters.

int[] codePoints = "🥦_Salade".codePoints().toArray() ;

[129382, 95, 83, 97, 108, 97, 100, 101]

Extract the first character’s code point.

int codePoint = codePoints[ 0 ] ;

129382

Make a single-character String object for that code point.

String firstCharacter = Character.toString( codePoint ) ; 

🥦

You can grab a subset of that int array of code points.

int[] firstFewCodePoints = Arrays.copyOfRange( codePoints , 0 , 3 ) ;

And make a String object from those code points.

String s = 
    Arrays
        .stream( firstFewCodePoints ) 
        .collect( StringBuilder::new , StringBuilder::appendCodePoint , StringBuilder::append )
        .toString();

🥦_S

Or we can use a constructor of String to take a subset of the array.

String result = new String( codePoints , 0 , 3 ) ;

🥦_S

See this code run live at IdeOne.com.

Sign up to request clarification or add additional context in comments.

2 Comments

Nice workaround. But, I was wondering, is there a way to know what should be the to value for Arrays.copyOfRange, so that the resultant string length will not exceed maxLength? The old way of doing is via string.substring(0, maxLength), but that may resultant invalid unicode code at the end of string.
Char is not obsolete and will never be. It's just no one gave any promises that a single char will correspond to a single Unicode code point.
6

The answer by Basil nicely shows that you should work with code points instead of chars.

A String does not store Unicode code points internally, so there is no way to know which characters belong together forming a Unicode code point, without inspecting the actual contents of the string.

Unicode-aware substring

Here is a Unicode-aware substring method. Since codePoints() returns an IntStream, we can utilize the skip and limit methods to extract a portion of the string.

public static String unicodeSubstring(String string, int beginIndex, int endIndex) {
    int length = endIndex - beginIndex;
    int[] codePoints = string.codePoints()
        .skip(beginIndex)
        .limit(length)
        .toArray();
    return new String(codePoints, 0, codePoints.length);
}

This is what happens in the abovementioned snippet of code. We stream over the Unicode code points, skipping the first beginIndex bytes and limiting the stream to endIndex − beginIndex, and then convertb to int[]. The result is that the int array contains all Unicode code points from beginIndex up to endIndex.

At last, the String class contains a nice constructor to construct a String from an int[] with code points, so we use it to get the String.


Of course, you could tweak the method to be a little more strict by rejecting out-of-bounds values:

if (endIndex < beginIndex) {
    throw new IllegalArgumentException("endIndex < beginIndex");
}
int length = endIndex - beginIndex;
int[] codePoints = string.codePoints()
    .skip(beginIndex)
    .limit(length)
    .toArray();
if (codePoints.length < length) {
    throw new IllegalArgumentException(
        "begin %s, end %s, length %s".formatted(beginIndex, endIndex, codePoints.length)
    );
}
return new String(codePoints, 0, codePoints.length);

Online demo

1 Comment

Rather than manually checking codepoints.length in your method, you could just pass length in to the String constructor which will throw if it is out of bounds. See also stackoverflow.com/a/55674580/2506021 for an alternative implementation that looks like it will throw exceptions for out of bounds values without additional checks.
1

I would like to provide another point of view, on how to implement substring which is able to guarantee the returned string contains valid unicode.

Unlike answer provided by @MC Emperor, my code treats

🥦 length as 2 (Instead of 1 by @MC Emperor)

This is important, to ensure the new function's behavior will resemble as close as old String.substring.

public static int length(String string) {
    if (string == null) {
        return 0;
    }
    return string.length();
}

public static String limitLength(String string, int maxLength) {
    int stringLength = length(string);

    if (stringLength <= maxLength) {
        return string;
    }

    List<Integer> codePointList = new ArrayList<>();

    for (int offset = 0; offset < maxLength; ) {
        final int codePoint = string.codePointAt(offset);

        final int charCount = Character.charCount(codePoint);

        if ((offset + charCount) > maxLength) {
            break;
        }

        codePointList.add(codePoint);

        offset += charCount;
    }

    int[] codePoints = new int[codePointList.size()];

    for (int i = 0; i < codePoints.length; i++)
    {
        codePoints[i] = codePointList.get(i);
    }

    String result = new String(codePoints, 0, Math.min(maxLength, codePoints.length));

    return result;
}

The code performance might not be efficient, if the maxLength value is large. But, I can't think of a much better way now. If you know a better way, feel free to amend the answer.

Comments

0

It is by design. Java provides many ways to extract individual Unicode code points from a string if it's necessary: see the Oracle tutorial.

However, most of the time it's not needed since you get the string index from a method like String.indexOf(String s) or Matcher.start(). In this case the resulting index won't point in the middle of a code point (as long as the argument s is a valid Unicode string).

It's even more common to work with regular expressions where string indexes don't come up altogether.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.