5

How to convert int array with UTF-8 string to StringBuilder in a while loop? For example:
int array: 71, 73, 70, 56, 57, 97, 149, 0, 55, 0, 247...
resulting string: GIF89a• €÷€ € €€ÀÜÀ¦Êð*?ª*?ÿ...
The line contains Latin, Cyrillic and Asian characters, and various symbols and numbers

do buffer.append((char)num[++i]);
while((byte)buffer.charAt(buffer.length()-1) != -1);

This method breaks down all non-Latin characters.

2
  • Could you show the data for the entire buffer? Commented Jun 7, 2012 at 20:30
  • +1 for getting weird symbols in the question.. :) Commented Jun 7, 2012 at 20:34

2 Answers 2

3

First of all convert the int[] to a byte[] as follows:

    //intArray contains your data...
    byte[] utf8bytes = new byte[intArray.length];
    for(int i = 0; i < intArray.length; i++)
    {
        utf8bytes[i] = (byte) intArray[i];
    }

Then create a string from your bytes specifying UTF-8 as the encoding:

    String asString = new String(utf8bytes, "UTF-8");
Sign up to request clarification or add additional context in comments.

3 Comments

Is int contains 1 byte instead of 4?
From your (admittedly small), selection of example values it looked like you were dealing with an array of ints < 256, and therefore easily castable into bytes. If you did have 4 bytes packed into your ints they would mostly have very large absolute values. You could unpack them into separate bytes using bit masks and logical shifts if that was the case....
utf8bytes[0] = (byte)(intArray[i] >>> 24); utf8bytes[1] = (byte)(intArray[i] >>> 16); utf8bytes[2] = (byte)(intArray[i] >>> 8); utf8bytes[3] = (byte)intArray[i]; After each Latin character adds 3 space characters. After each Cyrillic character adds 2 space characters.
0

You are reading in a GIF89a file as one integer per byte, and then printing it out as if it were a text string. The main problem is that the integers (bytes) within that file do not actually map to meaningful text characters, so where the mapping fails to render portions of the alphabet, it will render whatever your text encoding dictates (which looks to me like a lot of garbage).

Graphical information does not always map cleanly to text. While there are 256 possible byte values, and sometimes one or more bytes will represent a single character, there are only 26 letters in the English alphabet, which are represented in upper and lower case. Along with the ten digits and a handful of punctuation, you get about 80 different characters which are in common use in an essay. The rest of the 160+ characters are control codes, signals to use multi-bytes, or mappings to characters present to support display of foreign languages.

That garbage is the closest thing to the valid bytes to characters mapping for your current character set. If you want a better output, then try reading a file that contains data which maps to something character related.

1 Comment

No, this is just an example, the program is not designed for reading files. The program will work with text messages in Russian and Asian languages

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.