Python encoding format

Question

I have the need to store python str in a database to retrieve it and then apply a format() and encode() methods on it in order to shape my request frame an convert it into bytes and finally send() it through a socket.

MWE is like this:

fstr = '{slaveid:}{command:s}\x0d'
cstr = fstr.format(slaveid=chr(128+43), command='flags')
bstr = cstr.encode()

And produce the following output:

{slaveid:}{command:s}
«flags
b'\xc2\xabflags\r'

My problem occurs at the third line, char greater than 127 become two bytes when performing encode() method. Is suppose it is all about charset definition and because default encoding 'ascii' is limited to 127.

How should I define my encoding in order to get the following conversion:

b'\xabflags\r'

I am a little lost in front of charset tables.

Are you trying to create a bytearray ? You can try: bytearray(cstr) — Seif
– Seif, Commented Dec 24, 2015 at 11:21
"char greater than 127 become two bytes when performing encode() method. Is suppose it is all about charset definition" - That's because encode() is encoding the string as UTF-8 and \xc2\xab is the UTF-8 encoding for \u00ab (the « character). You could try cstr.encode(encoding='iso-8859-1') instead. — Gord Thompson
– Gord Thompson, Commented Dec 24, 2015 at 11:48
@Saksow: I know about bytearray but it not necessary here I think. I prefer control encoding when performing encode(). Thank you for answering — jlandercy
– jlandercy, Commented Dec 24, 2015 at 12:02
@MukundMK, Gord Thompson, I found 'cp1252' to work also. 'latin-1' and 'iso-8859-1' work too. Does one of you mind about writing an anwser that I accept, stating what is a best choice for encoding among those possibilities and if char between 128 and 255 are the same within each charset. Thank you anyway — jlandercy
– jlandercy, Commented Dec 24, 2015 at 12:06

Gord Thompson · Accepted Answer · 2016-01-06 18:56:50Z

2

As mentioned in the comments to the question, the issue is a result of the .encode() method encoding the string to UTF-8 by default. The character inserted by chr(128+43) is \u00ab which is encoded to two bytes in UTF-8: \xc2\xab.

The solution is to specify a single-byte character encoding when calling .encode(). Any of the following will work ...

cstr.encode(encoding='latin_1')
cstr.encode(encoding='iso-8859-1')
cstr.encode(encoding='cp1252')

... although it should be noted that while iso-8859-1 is just an alias for latin_1, cp1252 and latin_1 are not the same thing. However, in your case it shouldn't matter because the actual "character" is not important, just its (single) byte value in range(256).

answered Jan 6, 2016 at 18:56

Gord Thompson

125k38 gold badges251 silver badges458 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python encoding format

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related