Java substring broken encoding

Question

I read some data from stream in UTF-8 encoding

String line = new String(byteArray, "UTF-8");

then try to find some subsequence

int startPos = line.indexOf(tag) + tag.length();
int endPos   = line.indexOf("/", startPos);

and cut it

String name = line.substring(startPos, endPos);

In most cases it works fine, but some times result is broken. For example, for input name like "гордунни" I got values like "горд��нни", "горду��ни", "г��рдунни" etc. It seems like surrogate pairs are randomly broken for some reason. I got it 4 times out of 1000.

How to fix it? Do I need to use other String methods instead of indexOf()+substring() or to use some encoding/decoding magic on my result?

Is this trouble on linux? Where do you look "broken" lines? I had the same problem in SWT Table, but when I send this string in SWT Text or Label it displayed correct. The most likely is an displaying issue. — Nicolai
– Nicolai, Commented Oct 11, 2013 at 12:04
It's true that the indexOf and substring methods work on code points so potentially they can break up surrogate pairs, but гордунни has no surrogate pairs! Are you sure the text was correctly read to begin with? — Joni
– Joni, Commented Oct 11, 2013 at 14:02
Does it produce the same result if you add -Dfile.encoding=UTF-8 to the command line? — Alcanzar
– Alcanzar, Commented Oct 11, 2013 at 14:09
The fact that this happens so rarely suggests there could be a bug in buffer handling code somewhere. Can you reproduce the problem reliably using a longer string, maybe with 10.000 characters? Which version of Java are you using? — Joni
– Joni, Commented Oct 12, 2013 at 8:06
@Joni Thank you man, you were right. Cause of problem was in my stream handling code. For lage InputStream I read it with small chunks, and transform each array into String separately. Later I concatenate Strings if required. So, surrogate pairs could be just splitted inside different arrays, and later concatenation will provide this "broken" strings. If I transform whole InputStream into one String, problem disappears. Still have no idea how to do it with small chunks, but I found a cause of "broken" strings. Thanks — n00bot
– n00bot, Commented Oct 14, 2013 at 11:01

andrel · Accepted Answer · 2020-06-02 12:32:08Z

1

The problem occurs because the stream was read as chunks of bytes, sometimes splitting multi-byte UTF-8 characters.

By wrapping the InputStream in an InputStreamReader, you will read chunks of characters (as opposed to chunks of bytes), and multi-byte UTF-8 characters will survive.

edited Jun 2, 2020 at 12:32

answered Jul 15, 2014 at 6:59

andrel

1,15411 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

brighty · Accepted Answer · 2014-01-22 18:10:48Z

0

In your example, can you show the content of byteArray, of line and of tag? Can you also show what length will be obtained, what startPos and what endPos? I mean, within the string "гордунни" there is no "/"! And why do you calculate the endPos? What is the string inside tag? Are you sure substring's second parameter is the endpos and not the length? It is true that "гордунни" needs no surrogate pairs because all codepoints are below 0xFFFF, but once somewhere in your utf-16 string there is at least one surrogate-pair, i bet the length of the string will give you the number of word elements and not the number of codepoints. I am not sure about Java, but in C# length gives you the number of elements. To get the number of characters/codepoints you'll have to use the StringInfo class in C#. Check also if you'll have some BOM in your string. What is

String line = new String(byteArray, "UTF-8");

doing? Is the byte array an utf-8 encoded string getting transformed to utf-16? Does it contain a utf-8 BOM? Does the string afterwards have a utf-16LE or utf-16BE BOM?

answered Jan 22, 2014 at 18:10

brighty

4223 silver badges10 bronze badges

1 Comment

joel_s Over a year ago

Wrt the questions about BOMs the string will be encoded as a UTF-16 string which will have a BOM only if the UTF-8 string had a BOM (which according to the spec it should have)

Collectives™ on Stack Overflow

Java substring broken encoding

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related