Python - character encoding and decoding problems

Question

I have got 1 source file with utf-8 characters (names)
I have got 1 out file with same character encoding.
I am working with a html page, paste and cut the useful information for me to out file.
I use "éáűúőóüöäđĐ' characters in my "friendsNames" txt file.

And I gave this error:

Traceback (most recent call last):
  File "C:\Users\Rendszergazda\workspace\achievements\hiba.py", line 9, in <module>
    s = str(urlopen("http://eu.battle.net/wow/en/character/arathor/"+str(names[0])+"/achievement").read(), encoding='utf-8')
  File "C:\Python27\lib\encodings\cp1250.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

What do you think? What is my problem?

from urllib import urlopen
import codecs

result = codecs.open("C:\Users\Desktop\Achievements\Result.txt", "a", "utf-8")
fh = codecs.open("C:\Users\Desktop\Achievements\FriendsNames.txt", "r", "utf-8")
line = fh.readline()
names = line.split(" ")
fh.close()

s = urlopen("http://eu.battle.net/wow/en/character/arathor/"+str(names[0])+"/achievement").read(), encoding='utf8')
result.write(str(s))
result.close()

Just for information: The character 0xfeff is a BOM. Additionally your error message and your code sample do not match. — hochl
– hochl, Commented Mar 26, 2012 at 11:45
If you want to learn more about unicode, I strongly recommend bit.ly/unipain — Thomas Wouters
– Thomas Wouters, Commented Mar 26, 2012 at 11:46

Thomas Wouters · Accepted Answer · 2012-03-26 11:44:58Z

2

The problem you're having is that you're calling str(array[0]), where array[0] is a unicode string. This means it'll be encoded in the default encoding, which for some reason in your case seems to be cp1250. (Did you mess with sys.setdefaultencoding()? Don't do that.)

To get bytestrings out of unicode, you should explicitly encode the unicode. Don't just call str() on it. Encode it using the encoding the result should have (which in the case of URLs is somewhat difficult to guess at, but in this case is probably UTF-8.) So, use `array[0].encode('utf-8')'. You may also need to quote the non-ASCII characters in your URL, although that depends on what the remote end expects.

answered Mar 26, 2012 at 11:44

Thomas Wouters

134k23 gold badges153 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

user1292883 Over a year ago

But I gave a new problem, in utf-8 xy.write "eat" my "\n" , I tried this: u"\u000A" (utf-8 new line), but it does not work :(

Thomas Wouters Over a year ago

I'm afraid I don't understand the problem. u"\u00A" is the same thing as u"\n", and it's unicode, not UTF-8. (See bit.ly/unipain .) I suggest you post a new question describing your current problem.

alexis Over a year ago

You're probably on Windows and trying to open your output with Notepad, or something. Notepad only understands \r\n, but Word and Wordpad will display your file just fine.

Collectives™ on Stack Overflow

Python - character encoding and decoding problems

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related