How to Convert Extended ASCII to HTML Entity Names in Python?

Question

I'm currently doing this to replace extended-ascii characters with their HTML-entity-number equivalents:

s.encode('ascii', 'xmlcharrefreplace')

What I would like to do is convert to the HTML-entity-name equivalent (i.e. © instead of ©). This small program below shows what I'm trying to do that is failing. Is there a way to do this, aside from doing a find/replace?

#coding=latin-1

def convertEntities(s):
    return s.encode('ascii', 'xmlcharrefreplace')

ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'

ok_expected = ok
not_ok_expected = u'extended-ascii: &copy;&reg;&deg;&plusmn;&frac14;'

ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)

if ok_2 == ok_expected:
    print 'ascii worked'
else:
    print 'ascii failed: "%s"' % ok_2

if not_ok_2 == not_ok_expected:
    print 'extended-ascii worked'
else:
    print 'extended-ascii failed: "%s"' % not_ok_2

Wai Yip Tung · Accepted Answer · 2010-07-22 20:01:49Z

2

Is htmlentitydefs what you want?

import htmlentitydefs
htmlentitydefs.codepoint2name.get(ord(c),c)

answered Jul 22, 2010 at 20:01

Wai Yip Tung

18.9k10 gold badges46 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jason Coon Over a year ago

yes that is what I'm looking for.... not quite there yet though. I think I have a solution based on this

Wayne Werner · Accepted Answer · 2010-07-22 20:09:23Z

2

edit

Others have mentioned the htmlentitydefs that I never knew about. It would work with my code this way:

from htmlentitydefs import entitydefs as symbols

for tag, val in symbols.iteritems():
   mystr = mystr.replace("&{0};".format(tag), val)

And that should work.

edited Jul 22, 2010 at 20:09

answered Jul 22, 2010 at 20:00

Wayne Werner

52.3k35 gold badges213 silver badges304 bronze badges

4 Comments

Jason Coon Over a year ago

That's why I said "aside from a find/replace", in other words, I don't want to build a dictionary of 128 characters. This solution would work for the code I posted though

Wayne Werner Over a year ago

Well I just adapted my code to use htmlentitydefs that others have mentioned. Now you don't have to build it :)

Jason Coon Over a year ago

looks better... might need to add some checking for ASCII codes, since I don't want "<" to get replaced with <

Wayne Werner Over a year ago

entitydefs.get('<', False) => False - it's only a one way replacement: for e in entitydefs: print e to see all the tags.

SiggyF · Accepted Answer · 2010-07-22 20:03:43Z

1

I'm not sure how directly but I think the htmlentitydefs module will be of use. An example can be found here.

answered Jul 22, 2010 at 20:03

SiggyF

23.3k8 gold badges46 silver badges57 bronze badges

Comments

Jason Coon · Accepted Answer · 2010-07-23 15:35:55Z

1

Update This is the solution I'm going with, with a small fix to check that entitydefs contains a mapping for the character we have.

def convertEntities(s):
    return ''.join([getEntity(c) for c in s])

def getEntity(c):
    ord_c = ord(c)
    if ord_c > 127 and ord_c in htmlentitydefs.codepoint2name:
        return "&%s;" % htmlentitydefs.codepoint2name[ord_c]
    return c

edited Jul 23, 2010 at 15:35

answered Jul 22, 2010 at 20:13

Jason Coon

18.6k10 gold badges44 silver badges50 bronze badges

Comments

Duncan · Accepted Answer · 2010-07-22 20:43:54Z

Are you sure that you don't want the conversion to be reversible? Your ok_expected string indicates you don't want existing & characters escaped, so the conversion will be one way. The code below assumes that & should be escaped, but just remove the cgi.escape if you really don't want that.

Anyway, I'd combine your original approach with a regular expression substitution: do the encoding as before and then just fix up the numeric entities. That way you don't end up mapping every single character through your getEntity function.

#coding=latin-1
import cgi
import re
import htmlentitydefs

def replace_entity(match):
    c = int(match.group(1))
    name = htmlentitydefs.codepoint2name.get(c, None)
    if name:
        return "&%s;" % name
    return match.group(0)

def convertEntities(s):
    s = cgi.escape(s) # Remove if you want ok_expected to pass!
    s = s.encode('ascii', 'xmlcharrefreplace')
    s = re.sub("&#([0-9]+);", replace_entity, s)
    return s

ok = 'ascii: !@#$%^&*()<>'
not_ok = u'extended-ascii: ©®°±¼'

ok_expected = ok
not_ok_expected = u'extended-ascii: &copy;&reg;&deg;&plusmn;&frac14;'

ok_2 = convertEntities(ok)
not_ok_2 = convertEntities(not_ok)

if ok_2 == ok_expected:
    print 'ascii worked'
else:
    print 'ascii failed: "%s"' % ok_2

if not_ok_2 == not_ok_expected:
    print 'extended-ascii worked'
else:
    print 'extended-ascii failed: "%s"' % not_ok_2

Collectives™ on Stack Overflow

How to Convert Extended ASCII to HTML Entity Names in Python?

5 Answers 5

1 Comment

4 Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

1 Comment

4 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related