Why do Python and Java behave differently when decoding GBK

Question

This question is related to How to decode bytes in GB18030 correctly.

I would like to decode an array of bytes which are encoded with GBK, but found that Python and Java behave differently sometimes.

ch = b'\xA6\xDA'
print(ch.decode('gbk'))

It raises an error:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 0: illegal multibyte sequence

Java is able to decode it.

byte[] data = {(byte) 0xA6, (byte) 0xDA};
String s = new String(data, Charset.forName("GBK"));
System.out.println(s);

It seems that Python and Java adopt different implementations for GBK, right?

Related thread - Moreover, look at CP936, which is aliased to gbk in Python (check result of b'\xa6\xda'.decode('cp936')), which may be a source of confusion. On top of that, quoting wikipedia, "most modern-day Windows-based software products mean partial support for GBK via Windows-936 when they use the term "GB 2312" as a character encoding option", so my original hunch that Python maps GBK to GB 2312 may be correct. — metatoaster
– metatoaster, Commented May 12, 2023 at 4:21
That all being said, /bin/echo -e '\xa6\xda' | iconv -f gbk -t utf8 also report illegal input sequence at position 0, likewise if gbk is substituted for gb2312 or gb13000 - thus I am pretty sure a6 is out of range of those earlier standards and only gb18030 actually supports a6, and gbk in Java may in fact refer to the later standard. — metatoaster
– metatoaster, Commented May 12, 2023 at 4:29
If you really want to use gbk encoding, you can use errors='ignore' for characters outside the valid range: print(ch.decode('gbk', errors='ignore')) — Corralien
– Corralien, Commented May 12, 2023 at 4:37

chenzhongpu · Accepted Answer · 2023-05-13 08:01:52Z

1

It seems that Python and Java adopt different implementations for GBK, right?

Yes. GBK is a ambiguous encoding to some extent, and thus different platforms may adopt different implementations.

As for Python (CPython here), the mapping between GBK to Unicode is defined in mappings_cn.h, which is a strict implementation of CP936, where some characters (such as 0xA6DA) in non-Chinese regions are not defined.

In the contrast, GBK in Java (OpenJDK 17 here), in fact, is an extended CP936 where some extra non-Chinese characters are included in order to follow GB18030/MS936. For example, 0xA6DA is mapped to Unicode U+E78E (although it is a private-user-area one).

edited May 13, 2023 at 8:01

answered May 13, 2023 at 7:55

chenzhongpu

6,97510 gold badges52 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Why do Python and Java behave differently when decoding GBK

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related