Escaping HTML entities and UTF-8 in Python

Question

I am working on the parsing of a HTML file containing many special characters (in both Unicode and HTML entities forms). Despite having read a lot of documentation on Unicode with Python, I still cannot convert HTML entities properly.

Here is the test I ran:

>>> import HTMLParser
>>> p = HTMLParser.HTMLParser()
>>> s = p.unescape("&#139;")
>>> repr(s)
"u'\\x8b'"
>>> print s 
Â‹ # !!!
>>> s
u'\x8b'
>>> print s.encode("latin1")
‹ # OK, it prints fine in latin1, but I need UTF-8 ...
>>> print s.encode("utf8")
Â‹ # !!!

>>> import codecs
>>> out = codecs.open("out8.txt", encoding="utf8", mode="w")
>>> out.write(s)
# Viewing the file as ANSI gives me Â‹ # !!!
# Viewing the file as UTF8 gives NOTHING, as if the file were empty # !!!

What is the correct way of writing the unescaped string s to a UTF8 file ?

You would only see the correct output of the UTF-8 encoded print if your interactive session there where itself running on an utf-8 terminal. And it is not, because if it where, the print encoded as latin1 would have failed instead. — jsbueno
– jsbueno, Commented Oct 4, 2012 at 17:02
To anwer the question about the encoding of my session, the output of the locale command confirms that it is UTF-8. (I am under Linux) — Sébastien
– Sébastien, Commented Oct 4, 2012 at 17:19

Ignacio Vazquez-Abrams · Accepted Answer · 2012-10-04 16:55:47Z

3

U+008B is a control character, therefore seeing nothing is not unusual. "‹" is U+2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK, and is not even in Latin-1. It is, however, character 0x8B in CP1252. And stop relying on the Windows console output to tell you what's correct or not, unless you run chcp 65001 beforehand.

answered Oct 4, 2012 at 16:55

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Sébastien Over a year ago

I use a Linux console over SSH. From your explanation, I can infer that there is a bug in the unescape() function, that returns U+008B instead of U+2039. Am i wrong ?

Ignacio Vazquez-Abrams Over a year ago

The bug is in the data. Or more specifically, whatever generated the data. It should have used ‹ to encode the character, but instead selfishly assumed that the world revolves around Microsoft and used a character that doesn't exist in the proper specifications.

Sébastien Over a year ago

Nice link, I had not found the full list of HTML entities. You are right, the data is violating the specs. Running my test with ‹ gives the expected result.

Collectives™ on Stack Overflow

Escaping HTML entities and UTF-8 in Python

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related