I am working on the parsing of a HTML file containing many special characters (in both Unicode and HTML entities forms). Despite having read a lot of documentation on Unicode with Python, I still cannot convert HTML entities properly.
Here is the test I ran:
>>> import HTMLParser
>>> p = HTMLParser.HTMLParser()
>>> s = p.unescape("‹")
>>> repr(s)
"u'\\x8b'"
>>> print s
‹ # !!!
>>> s
u'\x8b'
>>> print s.encode("latin1")
‹ # OK, it prints fine in latin1, but I need UTF-8 ...
>>> print s.encode("utf8")
‹ # !!!
>>> import codecs
>>> out = codecs.open("out8.txt", encoding="utf8", mode="w")
>>> out.write(s)
# Viewing the file as ANSI gives me ‹ # !!!
# Viewing the file as UTF8 gives NOTHING, as if the file were empty # !!!
What is the correct way of writing the unescaped string s to a UTF8 file ?
latin1would have failed instead.localecommand confirms that it is UTF-8. (I am under Linux)