1
byte_1 = b'\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8'
byte_2 = bytes('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8', 'utf-8')
byte_3 = ('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8').encode()

Here byte_1 == byte_2 is False but byte_2 == byte_3 is True. Can someone help me understand why is that the case?

2
  • 6
    1 is already a byte string. 2 and 3 are Unicode strings being encoded to byte strings. Commented Feb 12, 2021 at 14:32
  • @MarkTolonen thank you. how do I convert 2 or 3 to 1? Commented Feb 12, 2021 at 15:45

2 Answers 2

2

byte_1 is already a byte string. byte_2 and byte_3 are Unicode strings being encoded to byte strings.

'\xc0' is an escape code representing a Unicode code point, U+00C0. b'\xc0' is an escape code representing a byte value of 0xC0.

In UTF-8, Unicode code points above U+007F are encoded in two or more bytes, so '\xc0'.encode() returns the two bytes b'\xc3\x80'.

If you want to get equivalent strings, use the latin1 codec. The Latin-1 (a.k.a ISO-8859-1) character set occupies the first 256 code points of the Unicode standard, and therefore maps 1:1 from code points < U+0100 to bytes. Example:

byte_1 = b'\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8'
byte_2 = bytes('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8', 'latin1')
byte_3 = ('\xc0*\xael<\x9e\xcf\x81\xa2\xd8\xb5\xe3\x1d\xe1\xaa8').encode('latin1')

print(byte_1 == byte_2)
print(byte_1 == byte_3)
print(byte_2 == byte_3)
True
True
True

Recommended reading:

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. This was exactly what I was looking for.
0

This happens because as pointed out by @mark, byte_1is a byte string. When you use the bytes() method for byte_2, you actually specified the encodings as unicode or utf-8. If you would have chose rather different encoding like ISO-8859-1 for byte_2, then byte_2 and byte_3 would not be the same. Since the encode() method has default encoding of utf-8, it results true for byte_2==byte_3

4 Comments

Thank you. What should be the string my_string so that I could convert that to be equal to byte_1?
@3123 you can try using the chardet module to detect the encoding of byte_1
thanks. using chardet i found that the encoding of byte_1 is windows-1252. However, when I do my_bytes.decode('windows-1252’), I get UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7: character maps to <undefined>
@3123 that is because byte sequence 0x81 in unicode has no equivalent char. in win1252 encoding

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.