4

I'm using python3,and I don't know what happened here:

x=[0xc2,0x50]
print('----list2bytes------')
for i in bytes(x):
  print(i)

s=''
for i in x:
  s+=chr(i)

print('----string2bytes----')
for i in s.encode():
  print(i)

print('----string2ord------')
for i in s:
  print(ord(i))

----list2bytes------
194
80
----string2bytes----
195
130
80
----string2ord------
194
80

Why did bytes change after string.encode()?

4
  • 2
    It seems you think that encode is just something like .toByteArray(). Please read a bit about it: programiz.com/python-programming/methods/string/encode. Commented Dec 2, 2018 at 8:19
  • 2
    Side note: It's a really bad idea to use str as a variable name since it's also a type. Commented Dec 2, 2018 at 9:00
  • yes,now I know python3 built-in encoding in windows is unicode,when unicode transfer to utf-8,then 194(unicode/ascii) change to C3 82(uft-8),and never transfer binary file to sring. Commented Dec 2, 2018 at 9:04
  • @chengxuncc: you may want to change this comment into an answer and mark the question as answered. Commented Dec 2, 2018 at 9:16

1 Answer 1

3

There are two different concepts involved here:

  • The chr() function will give you the character at the specified Unicode Code Point. You can look-up Code Point 194 here, it's LATIN CAPITAL LETTER A WITH CIRCUMFLEX (no surprises there).
  • Adding a character to a string will add that character, not a raw byte to that string. Getting bytes back involves an encoding.
  • When you call .encode() on the string, you get the bytes of it's UTF-8-encoding back. This is not just a concatenation of Code Points.
  • The UTF-8 encoding for the character  has two bytes because it's Ucode-value is greater than 128. The first byte is 192 + (Ucode-value div 64) == 192 + (194 div 64), which is 194 == 0xc2 again (adding to the confusion). The second byte is 128 + (Ucode-value div 64) == 128 + (194 % 64) == 0x82.

    Hence the character  encodes to 0xc2, 0x82 in UTF-8.

    The second character's (P) Ucode-value is below 128, so it's just added. Thereforce 0xc2, 0x82, 0x50 == 194, 130, 80 is the entire string encoded to UTF-8.

    It is entirely coincidental that the Code Point sequence 194, 80 encodes as 194, 130, 80 in UTF-8, giving the impression that 130 was simply inserted.

  • Calling ord() will give you Unicode Code Points for each character again. The integer representation of the Unicode Code Point for character LATIN CAPITAL LETTER A WITH CIRCUMFLEX is 194.
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.