0

I try to write a "string" to a file and get the following error message:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xcd' in position 6: ordinal not in range(128)

I tried the following methods:

print >>f, txt
print >>f, txt.decode('utf-8')
print >>f, txt.encode('utf-8')

None of them work. I have the same error message.

What is the idea behind encoding and decoding? If I have a unicode object can I write it to the file directly or I need to transform it to a string?

How can I find out what codding is used? How can I know if it is utf-8 or ascii or something else?

ADDED

I think I have just managed to save a string into a file. print >>f, txt as well as print >>f, txt.decode('utf-8') did not work but print >>f, txt.encode('utf-8') works. I get no error message and I see Chinese characters in my file.

5
  • 1
    And what's that string? Commented Apr 25, 2016 at 8:04
  • @EbraHim, I guess that it is a unicode object because I obtained the strings by reading them in the following way: for line in io.open(fname, encoding="utf8"): Commented Apr 25, 2016 at 8:06
  • @Roman for line in io.open(fname, encoding="utf8"): change the encoding to utf-8 Commented Apr 25, 2016 at 8:08
  • your question is answered here: stackoverflow.com/questions/6048085/… Commented Apr 25, 2016 at 8:11
  • Files contain bytes. Unicode strings are made up of code points. You need to translate those into bytes, there are many ways to do that, that is called encoding. Commented Apr 25, 2016 at 8:11

2 Answers 2

3

I recently posted another answer that addresses this very issue. Key quote:

For a good overview of the difference, read one of Joel's articles, but the gist is that bytes are, well, bytes (groups of 8 bits without any further meaning attached), whereas characters are the things that make up strings of text. Encoding turns characters into bytes, and decoding turns bytes back into characters.

In Python 2, unicode objects are character strings. Regular str objects can be either character strings or byte strings. (Pro tip: use Python 3, it makes keeping track a lot easier.)

You should be passing character strings (not byte strings) to print, but you will need to be sure that those character strings can be encoded by the codec (such as ASCII or UTF-8) associated with the destination file object f. As part of the output process, Python encodes the string for you. If the string contains characters that cannot be encoded by the file object's codec, you will get errors like the one you're seeing.

Without knowing what is in your txt object I can't be more specific.

Sign up to request clarification or add additional context in comments.

Comments

1

I think you need to use codecs library:

import codecs

file = codecs.open("test.txt", "w", "utf-8")
file.write(u'\xcd')
file.close()

Works fine.

The Story of Encoding/Decoding:

In the past, there were only about ~60 characters available in computers (including upper-case and lower-case letters + numbers + some special characters). So only 1 byte was enough to assign a unique number to each letter. Assigning numbers to letters for storing in memory is called encoding. This one byte encoding that is used in python by default is named ASCII.

With growth of computers in the world, we need to have more letters and characters in computer. So 1 byte is not enough. Different encoding schemes appeared. Unicode is one of the famous. The character that you are trying to store in your file is a Unicode character and it need 2 bytes, So you must explicitly indicate to Python that you don't want to use the default encoding, i.e. the ASCII (because you need 2 bytes for this character).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.