The answer by Basil nicely shows that you should work with code points instead of chars.
A String does not store Unicode code points internally, so there is no way to know which characters belong together forming a Unicode code point, without inspecting the actual contents of the string.
Unicode-aware substring
Here is a Unicode-aware substring method. Since codePoints() returns an IntStream, we can utilize the skip and limit methods to extract a portion of the string.
public static String unicodeSubstring(String string, int beginIndex, int endIndex) {
int length = endIndex - beginIndex;
int[] codePoints = string.codePoints()
.skip(beginIndex)
.limit(length)
.toArray();
return new String(codePoints, 0, codePoints.length);
}
This is what happens in the abovementioned snippet of code. We stream over the Unicode code points, skipping the first beginIndex bytes and limiting the stream to endIndex − beginIndex, and then convertb to int[]. The result is that the int array contains all Unicode code points from beginIndex up to endIndex.
At last, the String class contains a nice constructor to construct a String from an int[] with code points, so we use it to get the String.
Of course, you could tweak the method to be a little more strict by rejecting out-of-bounds values:
if (endIndex < beginIndex) {
throw new IllegalArgumentException("endIndex < beginIndex");
}
int length = endIndex - beginIndex;
int[] codePoints = string.codePoints()
.skip(beginIndex)
.limit(length)
.toArray();
if (codePoints.length < length) {
throw new IllegalArgumentException(
"begin %s, end %s, length %s".formatted(beginIndex, endIndex, codePoints.length)
);
}
return new String(codePoints, 0, codePoints.length);
Online demo