11

Using Python 3.4 I'm getting the following error when trying to decode a byte type using utf-32

Traceback (most recent call last):
  File "c:.\SharqBot.py", line 1130, in <module>
    fullR=s.recv(1024).decode('utf-32').split('\r\n')
UnicodeDecodeError: 'utf-32-le' codec can't decode bytes in position 0-3: codepoint not in range(0x110000)

and the following when trying to decode it into utf-16

  File "c:.\SharqBot.py", line 1128, in <module>
    fullR=s.recv(1024).decode('utf-16').split('\r\n')
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 374: truncated data

When I decode using utf-8 there is no error. s is a socket connected to the twitch IRC server irc.chat.twitch.tv on port 80.

It receives the following:

b':tmi.twitch.tv 001 absolutelyabot :Welcome, GLHF!\r\n:tmi.twitch.tv 002 absolutelyabot :Your host is tmi.twitch.tv\r\n:tmi.twitch.tv 003 absolutelyabot :This server is rather new\r\n:tmi.twitch.tv 004 absolutelyabot :-\r\n:tmi.twitch.tv 375 absolutelyabot :-\r\n:tmi.twitch.tv 372 absolutelyabot :You are in a maze of twisty passages, all alike.\r\n:tmi.twitch.tv 376 absolutelyabot :>\r\n'

Am I doing something wrong when trying to decode to utf 16 and 32? The reason I want to use utf-32 is because occasionally someone sends a character that is not in utf-8 and I want to be able to recieve that instead of it throwing an error because utf-8 does not support that character. Thanks for any help.

4
  • use decode('utf-8', errors='replace') for example. Commented Mar 21, 2016 at 19:44
  • I'm not trying to avoid the error all together, I'm trying to recieve the characters that aren't supported in utf-8. Commented Mar 21, 2016 at 19:48
  • So you can try to decode the whole line using UTF-8. If an exception is thrown, only then try an alternative charset. I doubt IRC protocl would allow UTF-16, 32 ever, because of embedded NULs Commented Mar 21, 2016 at 19:56
  • "When I decode using utf-8 there is no error". So why do you think UTF-16 or UTF-32 should work?? Commented Oct 21, 2019 at 16:56

3 Answers 3

21

try using encoding = 'ISO-8859-1'

Sign up to request clarification or add additional context in comments.

1 Comment

@CodeWarrior: Presumably the original text is latin-1 (the friendly name for ISO-8859-1) encoded, not utf-8. Or it isn't, but latin-1 is a one-to-one encoding where every byte maps to a character, so it's just masking errors and producing gibberish. Either way.
3

Every Unicode ordinal can be represented in UTF-8, if decodeing as UTF-8 isn't working, that's because the bytes being transmitted are in a different encoding, or the data is mixed text and binary data, and only some of it is UTF-8. Odds are the text is UTF-8 encoded (most network protocols are), so non-UTF-8 data would be framing data or the like, and would need to be parsed to extract the text data.

Any attempt to mask such an error in the text/binary case would just be silencing problems, not fixing them. You need to know the encoding of the data (and the format, if it's not all text data with a single encoding), and use that. The data you receive doesn't magically become UTF-16 or UTF-32 because you want it to.

1 Comment

IRC does not specify text encoding.
0

you can try with decode/encode('utf-16-le'). I tried it and it was OK to me. But I am not realy clear why. :P

1 Comment

Please try to be more clear with your answer and explain why this worked for you. Perhaps describe what is different between your approach and the OP

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.