1

I have the need to store python str in a database to retrieve it and then apply a format() and encode() methods on it in order to shape my request frame an convert it into bytes and finally send() it through a socket.

MWE is like this:

fstr = '{slaveid:}{command:s}\x0d'
cstr = fstr.format(slaveid=chr(128+43), command='flags')
bstr = cstr.encode()

And produce the following output:

{slaveid:}{command:s}
«flags
b'\xc2\xabflags\r'

My problem occurs at the third line, char greater than 127 become two bytes when performing encode() method. Is suppose it is all about charset definition and because default encoding 'ascii' is limited to 127.

How should I define my encoding in order to get the following conversion:

b'\xabflags\r'

I am a little lost in front of charset tables.

5
  • Are you trying to create a bytearray ? You can try: bytearray(cstr) Commented Dec 24, 2015 at 11:21
  • 1
    you can use cstr.encode('latin-1') Commented Dec 24, 2015 at 11:45
  • 1
    "char greater than 127 become two bytes when performing encode() method. Is suppose it is all about charset definition" - That's because encode() is encoding the string as UTF-8 and \xc2\xab is the UTF-8 encoding for \u00ab (the « character). You could try cstr.encode(encoding='iso-8859-1') instead. Commented Dec 24, 2015 at 11:48
  • @Saksow: I know about bytearray but it not necessary here I think. I prefer control encoding when performing encode(). Thank you for answering Commented Dec 24, 2015 at 12:02
  • @MukundMK, Gord Thompson, I found 'cp1252' to work also. 'latin-1' and 'iso-8859-1' work too. Does one of you mind about writing an anwser that I accept, stating what is a best choice for encoding among those possibilities and if char between 128 and 255 are the same within each charset. Thank you anyway Commented Dec 24, 2015 at 12:06

1 Answer 1

2

As mentioned in the comments to the question, the issue is a result of the .encode() method encoding the string to UTF-8 by default. The character inserted by chr(128+43) is \u00ab which is encoded to two bytes in UTF-8: \xc2\xab.

The solution is to specify a single-byte character encoding when calling .encode(). Any of the following will work ...

cstr.encode(encoding='latin_1')
cstr.encode(encoding='iso-8859-1')
cstr.encode(encoding='cp1252')

... although it should be noted that while iso-8859-1 is just an alias for latin_1, cp1252 and latin_1 are not the same thing. However, in your case it shouldn't matter because the actual "character" is not important, just its (single) byte value in range(256).

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.