Why does b(string) and bytes(string, ‘utf-8’) give different results in Python?

Question

byte_1 = b'\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8'
byte_2 = bytes('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8', 'utf-8')
byte_3 = ('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8').encode()

Here byte_1 == byte_2 is False but byte_2 == byte_3 is True. Can someone help me understand why is that the case?

1 is already a byte string. 2 and 3 are Unicode strings being encoded to byte strings. — Mark Tolonen
– Mark Tolonen, Commented Feb 12, 2021 at 14:32

Mark Tolonen · Accepted Answer · 2021-02-12 21:10:31Z

2

byte_1 is already a byte string. byte_2 and byte_3 are Unicode strings being encoded to byte strings.

'\xc0' is an escape code representing a Unicode code point, U+00C0. b'\xc0' is an escape code representing a byte value of 0xC0.

In UTF-8, Unicode code points above U+007F are encoded in two or more bytes, so '\xc0'.encode() returns the two bytes b'\xc3\x80'.

If you want to get equivalent strings, use the latin1 codec. The Latin-1 (a.k.a ISO-8859-1) character set occupies the first 256 code points of the Unicode standard, and therefore maps 1:1 from code points < U+0100 to bytes. Example:

byte_1 = b'\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8'
byte_2 = bytes('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8', 'latin1')
byte_3 = ('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8').encode('latin1')

print(byte_1 == byte_2)
print(byte_1 == byte_3)
print(byte_2 == byte_3)

True
True
True

1 Comment

3123 Over a year ago

Thank you. This was exactly what I was looking for.

Jyotirmay · Accepted Answer · 2021-02-12 14:42:39Z

0

This happens because as pointed out by @mark, byte_1is a byte string. When you use the bytes() method for byte_2, you actually specified the encodings as unicode or utf-8. If you would have chose rather different encoding like ISO-8859-1 for byte_2, then byte_2 and byte_3 would not be the same. Since the encode() method has default encoding of utf-8, it results true for byte_2==byte_3

answered Feb 12, 2021 at 14:42

Jyotirmay

5541 gold badge6 silver badges24 bronze badges

4 Comments

3123 Over a year ago

Thank you. What should be the string my_string so that I could convert that to be equal to byte_1?

Jyotirmay Over a year ago

@3123 you can try using the chardet module to detect the encoding of byte_1

3123 Over a year ago

thanks. using chardet i found that the encoding of byte_1 is windows-1252. However, when I do my_bytes.decode('windows-1252’), I get UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7: character maps to <undefined>

Jyotirmay Over a year ago

@3123 that is because byte sequence 0x81 in unicode has no equivalent char. in win1252 encoding

Collectives™ on Stack Overflow

Why does b(string) and bytes(string, ‘utf-8’) give different results in Python?

2 Answers 2

1 Comment

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related