1

I got a byte array which carries strings encoded in UCS-2LE, generally, the null string terminator in UCS-2LE string would be encoded as two null bytes (00 00), but sometimes there's only one as below:

import java.nio.charset.Charset;
import java.util.Arrays;

class Ucs {
    public static void main(String[] args) {
        byte[] b = new byte[] {87, 0, 105, 0, 110, 0, 0}; 
        String s = new String(b, Charset.forName("UTF-16LE"));
        System.out.println(Arrays.toString(s.getBytes()));
        System.out.println(s);
    }   
}

this program outputs

[87, 105, 110, -17, -65, -67]
Win�

I don't know why the internal byte array for string grows and where the unknown unicode comes from. How can I eliminate it?

4
  • getBytes() uses the user's default Java character encoding, which is unknown to us and probably unknown to you, too. Try dumping with a known, useful character encoding for Unicode such as UTF-16 or UTF-8. Commented Nov 9, 2017 at 2:39
  • "sometimes there's only one": Can you prevent the problem upstream? Commented Nov 9, 2017 at 2:43
  • If you don't like the replacement character (�) quietly indicating the data corruption, you can configure a character decoder that throws an exception instead. Commented Nov 9, 2017 at 2:46
  • @TomBlodget Thanks for the tip. Upstream is out of my control and wasted my time! Commented Nov 9, 2017 at 5:24

2 Answers 2

1

Would a hack to ignore a final odd-length byte help?

int bytesToUse = b.length%2 == 0 ? b.length : b.length - 1;
String s = new String(b, 0, bytesToUse, Charset.forName("UTF-16LE"));
Sign up to request clarification or add additional context in comments.

1 Comment

Yep, It's a way:)
1

use an InputStreamReader along with the proper Charset or a custom CharsetDecoder.

Reader reader = new InputStreamReader(
   new ByteArrayInputStream(new byte[]{87, 105, 110, -17, -65, -67,0,0}),
   Chaset.forName("UTF-16LE"));

Reader reader = new InputStreamReader(
   new ByteArrayInputStream(new byte[]{87, 105, 110, -17, -65, -67,0,0}),
   new CharsetDecoder(Chaset.forName("UTF-16LE"),1,2){
      @Override
      protected CoderResult     decodeLoop(ByteBuffer in, CharBuffer out){
        // detect trailing zero(s) to skip them
        // maybe employ the first version to do actual conversion
      }
   });

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.