Python 2 socket and string coding

Question

I am reading a file in utf-8 into unicode and I do not get any errors.

try:
        f = codecs.open(fil_name, "r","utf-8")
        f_str = f.read()

That is, the string f_str is in "unicode" Later in the program I have to send the (u) string in f_str to a socket. I am trying to convert the string back to "utf-8".

usock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
usock.connect(("xxx server", 123))
usock.send("TEXT %s\nENDQ\n" % f_str.replace("\n", " ").encode("utf-8"))

here I am getting an error message:

usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)

In my text, I have characters that cannot be coded with pure ASCII (äö..) but it is not a problem with utf-8 or latin-1. Why I am getting this error? I am not using ASCII, I am using unicode/utf-8???

Keith · Accepted Answer · 2012-03-17 11:57:23Z

1

Your string literal is a byte string. When you try to inperpolate into it Python will implicitly try to convert to byte string using the default encoding (ascii).

There are a couple of ways to fix this. One is just use Python 3. ;-)

If you are using Python 2 then put the following at the top of the source file:

from __future__ import unicode_literals

Then your literal will be unicode also.

You could also prefix the string with a 'u'.

Another problem with that line is precedence. The '%s' format operation is what is trying to convert your unicode into a string implicitly, using the ascii codec, after the right side is complete.

So, try this:

(u"TEXT %s\nENDQ\n" % f_str.replace(u"\n", u" ")).encode("utf-8")

answered Mar 17, 2012 at 11:57

Keith

43.2k11 gold badges61 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

andreSmol Over a year ago

Thanks Keith, the problem was the precedence of operations as you said. The parenthesis solved it

alonisser · Accepted Answer · 2012-03-17 11:16:06Z

0

begin with checking for the obvious python unicode checklist:

putting -*- encoding:utf-8 -*- at the top of every source file
checking if the text file encoding is utf-8 (most default is ascii 1255)

also

why do you need to encode('utf-8') if it is already unicode? what error message do you get if you don't do that?

and did you try to explicitly declare f_str as unicode: like

f_str=unicode(f_str)

also try printing f_str and check if you are getting the right result before.. maybe this is a problem with the data

answered Mar 17, 2012 at 11:16

alonisser

12.2k21 gold badges89 silver badges144 bronze badges

2 Comments

andreSmol Over a year ago

I am fullfilling the item 1 that you stated and the file is definitly in utf-8. I am able to read the data using the codecs.open(..."utf-8") that assures that the data is converted to unicode. I tried to print the f_str and it is printed correctly.

andreSmol Over a year ago

if I remove the encode("utf-8"), I am getting a similar error message:UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 46: ordinal not in range(128). It seems that Python tries to convert to ascii and it will find chars that cannot be converted. My text contains chars that are not ASCII.

unutbu · Accepted Answer · 2012-03-17 12:41:19Z

0

The error occurs on this line

usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))

I can reproduce a similar error this way:

In [23]: text = 'äö'

In [24]: 'TEXT %s'%text.replace("n", " ").encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Although you've shown that f_str is unicode, somehow, text is a str object. Some extra processing you are doing between f_str and text is probably making text a str.

If you can convert all input to unicode, work with them as unicode, and only convert back to a specific encoding upon output (as needed), your problem should be fixed.

answered Mar 17, 2012 at 12:41

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Collectives™ on Stack Overflow

Python 2 socket and string coding

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related