2

As a sample I have the following string, that I presume to be under utf-16 encoding: "hühühüh".

In python I get the following result when encoding

>>> base64.b64encode("hühühüh".encode("utf-16"))
b'//5oAPwAaAD8AGgA/ABoAA=='

In java:

>>> String test = "hühühüh";
>>> byte[] encodedBytes = Base64.getEncoder().encode(test.getBytes(StandardCharsets.UTF_16));
>>> String testBase64Encoded = new String(encodedBytes, StandardCharsets.US_ASCII);
>>> System.out.println(testBase64Encoded);
/v8AaAD8AGgA/ABoAPwAaA==

In javascript I define a binary encoding function as per the Mozilla dev guideline and then encode the same string.

>> function toBinary(string) {                                                                                                                            
      const codeUnits = new Uint16Array(string.length);
      for (let i = 0; i < codeUnits.length; i++) {
          codeUnits[i] = string.charCodeAt(i);
      }
      return String.fromCharCode(...new Uint8Array(codeUnits.buffer));
  }
>> atob(toBinary("hühühüh"))

aAD8AGgA/ABoAPwAaAA=

As you can see, each encoder created a distinct base64 string. So lets reverse the encoding again.

In Python all the generated strings decode fine again:

>>> base64.b64decode("//5oAPwAaAD8AGgA/ABoAA==").decode("utf-16")
'hühühüh'
>>> base64.b64decode("/v8AaAD8AGgA/ABoAPwAaA==").decode("utf-16")
'hühühüh'
>>> base64.b64decode("aAD8AGgA/ABoAPwAaAA=").decode("utf-16")
'hühühüh'

In javascript using the fromBinary function again as per the Mozilla dev guideline:

>>> function fromBinary(binary) {
  const bytes = new Uint8Array(binary.length);
  for (let i = 0; i < bytes.length; i++) {
    bytes[i] = binary.charCodeAt(i);
 }
  console.log(...bytes)
  return String.fromCharCode(...new Uint16Array(bytes.buffer));
}
>>> fromBinary(window.atob("//5oAPwAaAD8AGgA/ABoAA=="))
"\ufeffhühühüh"
>>> fromBinary(window.atob("/v8AaAD8AGgA/ABoAPwAaA=="))
"\ufffe栀ﰀ栀ﰀ栀ﰀ栀"
>>> fromBinary(window.atob("aAD8AGgA/ABoAPwAaAA="))
"hühühüh"

And finally in Java:

>>> String base64Encoded = "//5oAPwAaAD8AGgA/ABoAA==";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println(base64Decoded);
hühühüh
>>> String base64Encoded = "/v8AaAD8AGgA/ABoAPwAaA==";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println(base64Decoded);
hühühüh
>>> String base64Encoded = "aAD8AGgA/ABoAPwAaAA=";
>>> byte[] asBytes = Base64.getDecoder().decode(base64Encoded);
>>> String base64Decoded = new String(asBytes, StandardCharsets.UTF_16);
>>> System.out.println("Decoded" + base64Decoded);
hühühüh

We can see that python's base64 decoder is able to encode and decode messages for and from the other two parsers. But the definitions between the Java and Javascript parsers do not seem to be compatible with each other. I do not understand why this is. Is this a problem with the base64 libraries in Java and Javascript and if so, are there other tools or routes that let us pass base64 encoded utf-16 strings between a Java and Javascript application? How can I ensure safe base64 string transport between Java and Javscript applications by using tools as close to core language functionality as possible?

EDIT: As said in the accepted answer, the problem is different utf16 encodings. The compatibility problem between Java and Javascript can either be solved by generating the utf16 bytes in Javascript in reverse order, or accepting the encoded string as StandardCharsets.UTF_16LE.

1 Answer 1

5

The problem is that there are 4 variants of UTF-16.

This character encoding uses two bytes per code unit. Which of the two bytes should come first? This creates two variants:

  • UTF-16BE stores the most significant byte first.
  • UTF-16LE stores the least significant byte first.

To allow telling the difference between these two, there is an optional "byte order mark" (BOM) character, U+FEFF, at the start of the text. So UTF-16BE with BOM starts with the bytes fe ff while UTF-16LE with BOM starts with ff fe. Since BOM is optional, its presence doubles the number of possible encodings.

It looks like you are using 3 of the 4 possible encodings:

  • Python used UTF-16LE with BOM
  • Java used UTF-16BE with BOM
  • JavaScript used UTF-16LE without BOM

One of the reasons why people prefer UTF-8 to UTF-16 is to avoid this confusion.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the pointers, but I think your analysis is wrong. If you look at the python encoded strings, the python string starts with \xff\xfe, which is for little endian, while the Java string starts with \xfe\xff, which indicates big endian. So on the java side we need to indicate StandardCharsets.UTF_16LE for the non-BOM'ed Javascript encoded string. Both Python and Javascript use Little Endian, which is why the conversion works there.
You're right, Python and JS were little-endian, Java was big-endian. Fixed

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.