Convert html entities to ascii in Python

Question

I need to convert any html entity into its ASCII equivalent using Python. My use case is that I am cleaning up some HTML used to build emails to create plaintext emails from the HTML.

Right now, I only really know how to create unicode from these entities when I need ASCII (I think) so that the plaintext email reads correctly with things like accented characters. I think a basic example is the html entity "& aacute;" or á being encoded into ASCII.

Furthermore, I'm not even 100% sure that ASCII is what I need for a plaintext email. As you can tell, I'm completely lost on this encoding stuff.

Community · Accepted Answer · 2012-02-21 11:58:39Z

8

Here is a complete implementation that also handles unicode html entities. You might find it useful.

It returns a unicode string that is not ascii, but if you want plain ascii, you can modify the replace operations so that it replaces the entities to empty string.

def convert_html_entities(s):
    matches = re.findall("&#\d+;", s)
    if len(matches) > 0:
        hits = set(matches)
        for hit in hits:
            name = hit[2:-1]
            try:
                entnum = int(name)
                s = s.replace(hit, unichr(entnum))
            except ValueError:
                pass

    matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
    if len(matches) > 0:
        hits = set(matches)
        for hit in hits:
            hex = hit[3:-1]
            try:
                entnum = int(hex, 16)
                s = s.replace(hit, unichr(entnum))
            except ValueError:
                pass

    matches = re.findall("&\w+;", s)
    hits = set(matches)
    amp = "&amp;"
    if amp in hits:
        hits.remove(amp)
    for hit in hits:
        name = hit[1:-1]
        if htmlentitydefs.name2codepoint.has_key(name):
            s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
    s = s.replace(amp, "&")
    return s

Edit: added matching for hexcodes. I've been using this for a while now, and ran into my first situation with ' which is a single quote/apostrophe.

edited Feb 21, 2012 at 11:58

CommunityBot

11 silver badge

answered Oct 17, 2009 at 11:53

agazso

1652 silver badges8 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Andrew Over a year ago

Nice answer. It seems there should be something in a standard module to do this.

Alex Martelli · Accepted Answer · 2009-07-29 04:10:44Z

2

ASCII is the American Standard Code for Information Interchange and does not include any accented letters. Your best bet is to get Unicode (as you say you can) and encode it as UTF-8 (maybe ISO-8859-1 or some weird codepage if you're dealing with seriously badly coded user-agents/clients, sigh) -- the content type header of that part together with text/plain can express what encoding you've chosen to use (I do recommend trying UTF-8 unless you have positively demonstrated it cannot work -- it's almost universally supported these days and MUCH more flexible than any ISO-8859 or "codepage" hack!).

answered Jul 29, 2009 at 4:10

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

1 Comment

aezell Over a year ago

This worked well moving stuff to unicode first and then on to UTF-8, at least in my preliminary tests. We'll have to send out some mail tomorrow and see how it plays in the email clients. Thanks for the detailed explanation as to what I might really want as well.

ars · Accepted Answer · 2009-07-29 04:10:45Z

1

You can use the htmlentitydefs package:

import htmlentitydefs
print htmlentitydefs.entitydefs['aacute']

Basically, entitydefs is just a dictionary, and you can see this by printing it at the python prompt:

from pprint import pprint 
pprint htmlentitydefs.entitydefs

answered Jul 29, 2009 at 4:10

ars

124k23 gold badges151 silver badges135 bronze badges

3 Comments

Alex Martelli Over a year ago

that gives you (roughly) ISO-8859-1 codes -- which these days is a hopelessly obsolete approach even for a stubborn "western/european supremacist" (for example, even the euro sign doesn't fit there...!!!). htmlentitydefs.name2codepoint, which uniformly gives you the numeric codepoint (which you can turn into a unicode string of length 1 with unichr -- and then .encode as you wish), is vastly preferable.

aezell Over a year ago

Thanks for input ars. It certainly seems to get me the ASCII-ish stuff that I asked for, but I also needed a little more guidance on my use case as well.

ars Over a year ago

@aezell: Sure thing. I prefer Alex's answer, too, for the big picture, as well as his specific suggestions to my own answer. :)

mrjf · Accepted Answer · 2010-08-07 05:56:44Z

0

We put up a little module with agazso's function:

http://github.com/ARTFL/util/blob/master/ents.py

We find agazso's function to faster than the alternatives for ent conversion. Thanks for posting it.

answered Aug 7, 2010 at 5:56

mrjf

1,1751 gold badge12 silver badges24 bronze badges

Collectives™ on Stack Overflow

Convert html entities to ascii in Python

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related