2

I am working on the parsing of a HTML file containing many special characters (in both Unicode and HTML entities forms). Despite having read a lot of documentation on Unicode with Python, I still cannot convert HTML entities properly.

Here is the test I ran:

>>> import HTMLParser
>>> p = HTMLParser.HTMLParser()
>>> s = p.unescape("‹")
>>> repr(s)
"u'\\x8b'"
>>> print s 
‹ # !!!
>>> s
u'\x8b'
>>> print s.encode("latin1")
‹ # OK, it prints fine in latin1, but I need UTF-8 ...
>>> print s.encode("utf8")
‹ # !!!

>>> import codecs
>>> out = codecs.open("out8.txt", encoding="utf8", mode="w")
>>> out.write(s)
# Viewing the file as ANSI gives me ‹ # !!!
# Viewing the file as UTF8 gives NOTHING, as if the file were empty # !!!

What is the correct way of writing the unescaped string s to a UTF8 file ?

3
  • 1
    Are you at a command prompt in Windows, by chance? Commented Oct 4, 2012 at 16:53
  • You would only see the correct output of the UTF-8 encoded print if your interactive session there where itself running on an utf-8 terminal. And it is not, because if it where, the print encoded as latin1 would have failed instead. Commented Oct 4, 2012 at 17:02
  • To anwer the question about the encoding of my session, the output of the locale command confirms that it is UTF-8. (I am under Linux) Commented Oct 4, 2012 at 17:19

1 Answer 1

3

U+008B is a control character, therefore seeing nothing is not unusual. "‹" is U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK, and is not even in Latin-1. It is, however, character 0x8B in CP1252. And stop relying on the Windows console output to tell you what's correct or not, unless you run chcp 65001 beforehand.

Sign up to request clarification or add additional context in comments.

3 Comments

I use a Linux console over SSH. From your explanation, I can infer that there is a bug in the unescape() function, that returns U+008B instead of U+2039. Am i wrong ?
The bug is in the data. Or more specifically, whatever generated the data. It should have used ‹ to encode the character, but instead selfishly assumed that the world revolves around Microsoft and used a character that doesn't exist in the proper specifications.
Nice link, I had not found the full list of HTML entities. You are right, the data is violating the specs. Running my test with ‹ gives the expected result.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.