0

I am reading a file in utf-8 into unicode and I do not get any errors.

try:
        f = codecs.open(fil_name, "r","utf-8")
        f_str = f.read()

That is, the string f_str is in "unicode" Later in the program I have to send the (u) string in f_str to a socket. I am trying to convert the string back to "utf-8".

usock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
usock.connect(("xxx server", 123))
usock.send("TEXT %s\nENDQ\n" % f_str.replace("\n", " ").encode("utf-8"))

here I am getting an error message:

usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)

In my text, I have characters that cannot be coded with pure ASCII (äö..) but it is not a problem with utf-8 or latin-1. Why I am getting this error? I am not using ASCII, I am using unicode/utf-8???

3 Answers 3

1

Your string literal is a byte string. When you try to inperpolate into it Python will implicitly try to convert to byte string using the default encoding (ascii).

There are a couple of ways to fix this. One is just use Python 3. ;-)

If you are using Python 2 then put the following at the top of the source file:

from __future__ import unicode_literals

Then your literal will be unicode also.

You could also prefix the string with a 'u'.

Another problem with that line is precedence. The '%s' format operation is what is trying to convert your unicode into a string implicitly, using the ascii codec, after the right side is complete.

So, try this:

(u"TEXT %s\nENDQ\n" % f_str.replace(u"\n", u" ")).encode("utf-8")
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks Keith, the problem was the precedence of operations as you said. The parenthesis solved it
0

begin with checking for the obvious python unicode checklist:

  1. putting -*- encoding:utf-8 -*- at the top of every source file
  2. checking if the text file encoding is utf-8 (most default is ascii 1255)

also

why do you need to encode('utf-8') if it is already unicode? what error message do you get if you don't do that?

and did you try to explicitly declare f_str as unicode: like

f_str=unicode(f_str)

also try printing f_str and check if you are getting the right result before.. maybe this is a problem with the data

2 Comments

I am fullfilling the item 1 that you stated and the file is definitly in utf-8. I am able to read the data using the codecs.open(..."utf-8") that assures that the data is converted to unicode. I tried to print the f_str and it is printed correctly.
if I remove the encode("utf-8"), I am getting a similar error message:UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 46: ordinal not in range(128). It seems that Python tries to convert to ascii and it will find chars that cannot be converted. My text contains chars that are not ASCII.
0

The error occurs on this line

usock.send("TEXT %s\nENDQ\n" % text.replace("\n", " ").encode("utf-8"))

I can reproduce a similar error this way:

In [23]: text = 'äö'

In [24]: 'TEXT %s'%text.replace("n", " ").encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Although you've shown that f_str is unicode, somehow, text is a str object. Some extra processing you are doing between f_str and text is probably making text a str.

If you can convert all input to unicode, work with them as unicode, and only convert back to a specific encoding upon output (as needed), your problem should be fixed.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.