0

This question is related to How to decode bytes in GB18030 correctly.

I would like to decode an array of bytes which are encoded with GBK, but found that Python and Java behave differently sometimes.

ch = b'\xA6\xDA'
print(ch.decode('gbk'))

It raises an error:

UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 0: illegal multibyte sequence

Java is able to decode it.

byte[] data = {(byte) 0xA6, (byte) 0xDA};
String s = new String(data, Charset.forName("GBK"));
System.out.println(s);

It seems that Python and Java adopt different implementations for GBK, right?

8
  • Does it work if you replace gbk (gb16500) with gb18030? Commented May 12, 2023 at 4:10
  • @Corralien Yes. Both Java and Python work if I use GB18030. Commented May 12, 2023 at 4:14
  • 1
    Related thread - Moreover, look at CP936, which is aliased to gbk in Python (check result of b'\xa6\xda'.decode('cp936')), which may be a source of confusion. On top of that, quoting wikipedia, "most modern-day Windows-based software products mean partial support for GBK via Windows-936 when they use the term "GB 2312" as a character encoding option", so my original hunch that Python maps GBK to GB 2312 may be correct. Commented May 12, 2023 at 4:21
  • That all being said, /bin/echo -e '\xa6\xda' | iconv -f gbk -t utf8 also report illegal input sequence at position 0, likewise if gbk is substituted for gb2312 or gb13000 - thus I am pretty sure a6 is out of range of those earlier standards and only gb18030 actually supports a6, and gbk in Java may in fact refer to the later standard. Commented May 12, 2023 at 4:29
  • If you really want to use gbk encoding, you can use errors='ignore' for characters outside the valid range: print(ch.decode('gbk', errors='ignore')) Commented May 12, 2023 at 4:37

1 Answer 1

1

It seems that Python and Java adopt different implementations for GBK, right?

Yes. GBK is a ambiguous encoding to some extent, and thus different platforms may adopt different implementations.

As for Python (CPython here), the mapping between GBK to Unicode is defined in mappings_cn.h, which is a strict implementation of CP936, where some characters (such as 0xA6DA) in non-Chinese regions are not defined.

In the contrast, GBK in Java (OpenJDK 17 here), in fact, is an extended CP936 where some extra non-Chinese characters are included in order to follow GB18030/MS936. For example, 0xA6DA is mapped to Unicode U+E78E (although it is a private-user-area one).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.