This question is related to How to decode bytes in GB18030 correctly.
I would like to decode an array of bytes which are encoded with GBK, but found that Python and Java behave differently sometimes.
ch = b'\xA6\xDA'
print(ch.decode('gbk'))
It raises an error:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 0: illegal multibyte sequence
Java is able to decode it.
byte[] data = {(byte) 0xA6, (byte) 0xDA};
String s = new String(data, Charset.forName("GBK"));
System.out.println(s);
It seems that Python and Java adopt different implementations for GBK, right?
gbkin Python (check result ofb'\xa6\xda'.decode('cp936')), which may be a source of confusion. On top of that, quoting wikipedia, "most modern-day Windows-based software products mean partial support for GBK via Windows-936 when they use the term "GB 2312" as a character encoding option", so my original hunch that Python maps GBK to GB 2312 may be correct./bin/echo -e '\xa6\xda' | iconv -f gbk -t utf8also reportillegal input sequence at position 0, likewise ifgbkis substituted forgb2312orgb13000- thus I am pretty surea6is out of range of those earlier standards and onlygb18030actually supportsa6, andgbkin Java may in fact refer to the later standard.gbkencoding, you can useerrors='ignore'for characters outside the valid range:print(ch.decode('gbk', errors='ignore'))