String character encoding with Android and Oracle Java

Question

I have code which computes the SHA-256 hash of a String, and noticed that I was getting different hashes from Android and Oracle Java 7 for the same string. My hashing code converts the String into byte[] with:

byte[] data = stringData.getBytes("UTF-16");

With UTF-16 encoding, I get different results from Oracle Java and Android Java. This is the string I was hashing:

// Test Code:
String toHash = "testdata";
System.out.println("Hash: " +DataHash.getHashString(toHash));

And get theses hashes with UTF-16:

Hash: a1112a0363a59097a701e38398e1fdfef3049358aee81b77ecaad2924a426bc5 [Oracle Java 7]
Hash: 811b723aee07c7a52456fc57a5683e73649075a373d341f7257bf73575111ba3 [Android 2.2]

However, with UTF-8, I get the same hash with both JREs:

Hash: 810ff2fb242a5dee4220f2cb0e6a519891fb67f2f828a6cab4ef8894633b1f50 [Oracle Java 7]
Hash: 810ff2fb242a5dee4220f2cb0e6a519891fb67f2f828a6cab4ef8894633b1f50 [Android 2.2]

Is there some kind of endian-ness issue going on which is causing the different results on the different platforms? How should I really be preparing a String to be hashed in a platform independent way?

EDIT: Whoops, the answer is rather obvious once you read about UTF-16 a bit more. There are two versions of UTF-16 (big-endian and little-endian). You just need to specify which version getBytes() should use, and the hashes are the same. Pick one of:

UTF-16LE
UTF-16BE

Kara · Accepted Answer · 2014-02-03 18:06:18Z

1

According to the documentation of Orcale Java:

When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

That means plain UTF-16 should always encode as Big Endian in Oracle Java.

Then from Android Java documentation:

Charset            Encoder writes
UTF-16BE           BE, no BOM
UTF-16LE           LE, no BOM
UTF-16             BE, with BE BOM

So there is a bug in either one, or in the documentation. Both must be Big Endian, and write BOM, so there shouldn't be any difference.

In general you should prefer UTF-16BE/LE over UTF-16, but in this case it seems to be a bug.

edited Feb 3, 2014 at 18:06

Kara

6,23616 gold badges54 silver badges58 bronze badges

answered Dec 18, 2012 at 10:21

Esailija

140k24 gold badges280 silver badges328 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Taj Morton Over a year ago

Ahh, interesting. It does look like Android (2.2 at least) is doing little-endian conversion: Oracle Java 7: UTF-16: [-2, -1, 0, 116, 0, 101, 0, 115, 0, 116, 0, 100, 0, 97, 0, 116, 0, 97] Android Java 2.2: UTF-16: [-1, -2, 116, 0, 101, 0, 115, 0, 116, 0, 100, 0, 97, 0, 116, 0, 97, 0]

Esailija Over a year ago

@TajMorton -1, -2, 116, 0.. is Little Endian, with LE BOM. Is that from Android? Anyway, it clearly contradicts with Android documentation.

Taj Morton Over a year ago

Sorry, my formatting got destroyed and I accidentally posted before I was ready. Oracle Java 7 gave [-2, -1, 0, 116] with "UTF-16", whereas Android 2.2 gave [-2, -1, 116, 0]. So yes, it does look like it's producing LE with a LE BOM.

Nikolay Elenkov · Accepted Answer · 2012-12-18 05:37:37Z

0

Show your hashing code, but it is probably doing something wrong. The results of hashing is a byte[] so there is no need to convert from string to byte[] in the first place. For converting a binary hash value to a String use Base64 or hex encoding.

answered Dec 18, 2012 at 5:37

Nikolay Elenkov

53k12 gold badges86 silver badges84 bronze badges

Collectives™ on Stack Overflow

String character encoding with Android and Oracle Java

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related