I read some data from stream in UTF-8 encoding
String line = new String(byteArray, "UTF-8");
then try to find some subsequence
int startPos = line.indexOf(tag) + tag.length();
int endPos = line.indexOf("/", startPos);
and cut it
String name = line.substring(startPos, endPos);
In most cases it works fine, but some times result is broken. For example, for input name like "гордунни" I got values like "горд��нни", "горду��ни", "г��рдунни" etc.
It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.
How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?
indexOfandsubstringmethods work on code points so potentially they can break up surrogate pairs, butгордунниhas no surrogate pairs! Are you sure the text was correctly read to begin with?