Convert XML/HTML Entities into Unicode String in Python [duplicate]

Question

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back:

&#x01ce;

which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

related: Decode HTML entities in Python string?

jfs
– jfs

2016-02-02 08:36:49 +00:00
Commented Feb 2, 2016 at 8:36 — jfs
– jfs, Commented Feb 2, 2016 at 8:36

sophros · Accepted Answer · 2020-03-27 07:56:47Z

61

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('&copy; 2010') # u'\xa9 2010'
h.unescape('&#169; 2010') # u'\xa9 2010'

Python 3.4+:

import html
html.unescape('&copy; 2010') # u'\xa9 2010'
html.unescape('&#169; 2010') # u'\xa9 2010'

edited Mar 27, 2020 at 7:56

sophros

17.3k12 gold badges52 silver badges84 bronze badges

answered Sep 27, 2012 at 5:34

Vladislav

1,32616 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

jfs Over a year ago

it also works for hex entities. The implementation is very similar to unescape() function from @dF.'s answer.

Aram Dulyan Over a year ago

This method isn't documented in Python's HTMLParser documentation, and there's a comment in the source stating it's intended for internal use. However, it works like treat in Python 2.6 through 2.7, and is probably the best solution out there. Prior to version 2.6, it would only decode named entities like & or >.

jfs Over a year ago

It is exposed as html.unescape() function in Python 3.4+

Stan Over a year ago

This raise UnicodeDecodeError with utf-8 strings. You must either decode('utf-8') it first or use xml.sax.saxutils.unescape.

Stew · Accepted Answer · 2015-09-01 17:12:56Z

60

Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

edited Sep 1, 2015 at 17:12

Stew

4,5956 gold badges35 silver badges51 bronze badges

answered Sep 12, 2008 at 1:40

dF.

76.1k31 gold badges136 silver badges137 bronze badges

3 Comments

smci Over a year ago

Absolutely. Why is not in stdlib?

jnns Over a year ago

Looking at its code, it doesn't seem to work with & and such, does it?

joel.d Over a year ago

Just tested successfully for &

chryss · Accepted Answer · 2008-09-11 23:09:08Z

18

Use the builtin unichr -- BeautifulSoup isn't necessary:

>>> entity = '&#x01ce'
>>> unichr(int(entity[3:],16))
u'\u01ce'

answered Sep 11, 2008 at 23:09

chryss

7,51941 silver badges46 bronze badges

2 Comments

smci Over a year ago

But that requires you to automatically and unambiguously know where in the string the encoded Unicode character is/are - which you can't know. And you need to try...catch the resulting exception for when you get it wrong.

Stefan Collier Over a year ago

unichar was removed in python3. Any suggestion for that version?

Markus Amalthea Magnuson · Accepted Answer · 2018-07-20 19:35:48Z

18

If you are on Python 3.4 or newer, you can simply use the html.unescape:

import html

s = html.unescape(s)

edited Jul 20, 2018 at 19:35

answered Dec 11, 2014 at 14:12

Markus Amalthea Magnuson

8,8314 gold badges45 silver badges49 bronze badges

Comments

pragmar · Accepted Answer · 2012-02-09 19:01:45Z

16

An alternative, if you have lxml:

>>> import lxml.html
>>> lxml.html.fromstring('&#x01ce').text
u'\u01ce'

edited Feb 9, 2012 at 19:01

answered Feb 9, 2012 at 18:55

pragmar

1,0349 silver badges14 bronze badges

2 Comments

pintoch Over a year ago

Be careful though, because this can also return an object of type str if there is no special character.

Mansoor Akram Over a year ago

best solution when everything fails, only lxml comes to rescue. :)

Community · Accepted Answer · 2017-05-23 11:47:17Z

8

You could find an answer here -- Getting international characters from a web page?

EDIT: It seems like BeautifulSoup doesn't convert entities written in hexadecimal form. It can be fixed:

import copy, re
from BeautifulSoup import BeautifulSoup

hexentityMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
# replace hexadecimal character reference by decimal one
hexentityMassage += [(re.compile('&#x([^;]+);'), 
                     lambda m: '&#%d;' % int(m.group(1), 16))]

def convert(html):
    return BeautifulSoup(html,
        convertEntities=BeautifulSoup.HTML_ENTITIES,
        markupMassage=hexentityMassage).contents[0].string

html = '<html>&#x01ce;&#462;</html>'
print repr(convert(html))
# u'\u01ce\u01ce'

EDIT:

unescape() function mentioned by @dF which uses htmlentitydefs standard module and unichr() might be more appropriate in this case.

edited May 23, 2017 at 11:47

CommunityBot

11 silver badge

answered Sep 11, 2008 at 21:52

jfs

417k210 gold badges1k silver badges1.7k bronze badges

5 Comments

Cristian Over a year ago

This solution doesn't work with the example: print BeautifulSoup('<html>ǎ</html>', convertEntities=BeautifulSoup.HTML_ENTITIES) This returns the same HTML entity

Martijn Pieters Over a year ago

Note: this only applied to BeautifulSoup 3, deprecated and considered legacy since 2012. BeautifulSoup 4 handles HTML entities like these automatically.

jfs Over a year ago

@MartijnPieters: correct. html.unescape() is a better option on the modern Python.

Martijn Pieters Over a year ago

Absolutely. If all you wanted was to decode HTML entities there is no need to use BeatifulSoup at all.

jfs Over a year ago

@MartijnPieters: on old Python versions, unless HTMLParser.HTMLParser().unescape() hack worked for you, using BeautifulSoup might be a better alternative than defining unescape() by hand (vendoring a pure Python lib vs. a copy-paste of the function).

karlcow · Accepted Answer · 2009-02-21 19:45:58Z

5

This is a function which should help you to get it right and convert entities back to utf-8 characters.

def unescape(text):
   """Removes HTML or XML character references 
      and entities from a text string.
   @param text The HTML (or XML) source text.
   @return The plain text, as a Unicode string, if necessary.
   from Fredrik Lundh
   2008-01-03: input only unicode characters string.
   http://effbot.org/zone/re-sub.htm#unescape-html
   """
   def fixup(m):
      text = m.group(0)
      if text[:2] == "&#":
         # character reference
         try:
            if text[:3] == "&#x":
               return unichr(int(text[3:-1], 16))
            else:
               return unichr(int(text[2:-1]))
         except ValueError:
            print "Value Error"
            pass
      else:
         # named entity
         # reescape the reserved characters.
         try:
            if text[1:-1] == "amp":
               text = "&amp;amp;"
            elif text[1:-1] == "gt":
               text = "&amp;gt;"
            elif text[1:-1] == "lt":
               text = "&amp;lt;"
            else:
               print text[1:-1]
               text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
         except KeyError:
            print "keyerror"
            pass
      return text # leave as is
   return re.sub("&#?\w+;", fixup, text)

answered Feb 21, 2009 at 19:45

karlcow

6,9824 gold badges42 silver badges72 bronze badges

2 Comments

dariopy Over a year ago

Why is this answer modded down? It seems useful to me.

karlcow Over a year ago

because the person wanted the character in unicode instead of utf-8 characters. I guess :)

Balthazar Rouberol · Accepted Answer · 2013-03-14 15:58:17Z

Not sure why the Stack Overflow thread does not include the ';' in the search/replace (i.e. lambda m: '&#%d*;*') If you don't, BeautifulSoup can barf because the adjacent character can be interpreted as part of the HTML code (i.e. &#39B for &#39Blackout).

This worked better for me:

import re
from BeautifulSoup import BeautifulSoup

html_string='<a href="/cgi-bin/article.cgi?f=/c/a/2010/12/13/BA3V1GQ1CI.DTL"title="">&#x27;Blackout in a can; on some shelves despite ban</a>'

hexentityMassage = [(re.compile('&#x([^;]+);'), 
lambda m: '&#%d;' % int(m.group(1), 16))]

soup = BeautifulSoup(html_string, 
convertEntities=BeautifulSoup.HTML_ENTITIES, 
markupMassage=hexentityMassage)

The int(m.group(1), 16) converts the number (specified in base-16) format back to an integer.
m.group(0) returns the entire match, m.group(1) returns the regexp capturing group
Basically using markupMessage is the same as:
html_string = re.sub('&#x([^;]+);', lambda m: '&#%d;' % int(m.group(1), 16), html_string)

score 1 · Accepted Answer · 2015-11-02 20:41:37Z

1

Another solution is the builtin library xml.sax.saxutils (both for html and xml). However, it will convert only &gt, &amp and &lt.

from xml.sax.saxutils import unescape

escaped_text = unescape(text_to_escape)

edited Nov 2, 2015 at 20:41

answered Nov 2, 2015 at 20:28

user3946687

Comments

Community · Accepted Answer · 2017-05-23 12:03:02Z

0

Here is the Python 3 version of dF's answer:

import re
import html.entities

def unescape(text):
    """
    Removes HTML or XML character references and entities from a text string.

    :param text:    The HTML (or XML) source text.
    :return:        The plain text, as a Unicode string, if necessary.
    """
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return chr(int(text[3:-1], 16))
                else:
                    return chr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = chr(html.entities.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

The main changes concern htmlentitydefs that is now html.entities and unichr that is now chr. See this Python 3 porting guide.

edited May 23, 2017 at 12:03

CommunityBot

11 silver badge

answered Dec 25, 2015 at 13:55

Victor

3,6312 gold badges22 silver badges22 bronze badges

2 Comments

Martijn Pieters Over a year ago

In Python 3, you'd just use html.unescape(); why have a dog and bark yourself?

Jens Over a year ago

html.entities.entitydefs["apos"] does not exist, and html.unescape('can't') produces "can't" which uses the U+0027 (') instead of the proper U+2019 (’) (or U+02BC, depending on which argument you follow.). But I guess that’s intended according to the character entity reference.

Collectives™ on Stack Overflow

Convert XML/HTML Entities into Unicode String in Python [duplicate]

10 Answers 10

4 Comments

3 Comments

2 Comments

Comments

2 Comments

5 Comments

2 Comments

1 Comment

Comments

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

4 Comments

3 Comments

2 Comments

Comments

2 Comments

5 Comments

2 Comments

1 Comment

Comments

2 Comments

Linked

Related