7

I read some data from stream in UTF-8 encoding

String line = new String(byteArray, "UTF-8");

then try to find some subsequence

int startPos = line.indexOf(tag) + tag.length();
int endPos   = line.indexOf("/", startPos);

and cut it

String name = line.substring(startPos, endPos);

In most cases it works fine, but some times result is broken. For example, for input name like "гордунни" I got values like "горд��нни", "горду��ни", "г��рдунни" etc. It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.

How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?

11
  • 1
    Is this trouble on linux? Where do you look "broken" lines? I had the same problem in SWT Table, but when I send this string in SWT Text or Label it displayed correct. The most likely is an displaying issue. Commented Oct 11, 2013 at 12:04
  • It's true that the indexOf and substring methods work on code points so potentially they can break up surrogate pairs, but гордунни has no surrogate pairs! Are you sure the text was correctly read to begin with? Commented Oct 11, 2013 at 14:02
  • Does it produce the same result if you add -Dfile.encoding=UTF-8 to the command line? Commented Oct 11, 2013 at 14:09
  • 1
    The fact that this happens so rarely suggests there could be a bug in buffer handling code somewhere. Can you reproduce the problem reliably using a longer string, maybe with 10.000 characters? Which version of Java are you using? Commented Oct 12, 2013 at 8:06
  • 1
    @Joni Thank you man, you were right. Cause of problem was in my stream handling code. For lage InputStream I read it with small chunks, and transform each array into String separately. Later I concatenate Strings if required. So, surrogate pairs could be just splitted inside different arrays, and later concatenation will provide this "broken" strings. If I transform whole InputStream into one String, problem disappears. Still have no idea how to do it with small chunks, but I found a cause of "broken" strings. Thanks Commented Oct 14, 2013 at 11:01

2 Answers 2

1

The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.

By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.

Sign up to request clarification or add additional context in comments.

Comments

0

In your example, can you show the content of byteArray, of line and of tag? Can you also show what length will be obtained, what startPos and what endPos? I mean, within the string "гордунни" there is no "/"! And why do you calculate the endPos? What is the string inside tag? Are you sure substring's second parameter is the endpos and not the length? It is true that "гордунни" needs no surrogate pairs because all codepoints are below 0xFFFF, but once somewhere in your utf-16 string there is at least one surrogate-pair, i bet the length of the string will give you the number of word elements and not the number of codepoints. I am not sure about Java, but in C# length gives you the number of elements. To get the number of characters/codepoints you'll have to use the StringInfo class in C#. Check also if you'll have some BOM in your string. What is


String line = new String(byteArray, "UTF-8");

doing? Is the byte array an utf-8 encoded string getting transformed to utf-16? Does it contain a utf-8 BOM? Does the string afterwards have a utf-16LE or utf-16BE BOM?

1 Comment

Wrt the questions about BOMs the string will be encoded as a UTF-16 string which will have a BOM only if the UTF-8 string had a BOM (which according to the spec it should have)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.