issue with converting html entities and encoding

Question

I'm using this function to escape the HTML enities

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

but when i try to process some text i get this error, (most of the text works) but python throws me this error

File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
  return codecs.charmap_encode(input,errors,encoding_map)
  UnicodeEncodeError: 'charmap' codec can't encode character u'\xae' in position 3
 48: character maps to <undefined>

i have tried encoding the text string a million different ways, nothing is working so far ascii, utf, unicode... all that stuff which i really don't understand

Please post the repr of the offending text so we can reproduce the problem. — unutbu
– unutbu, Commented Dec 28, 2011 at 12:54
Your code seems to assume that the document is ASCII-encoded using these references to represent other characters. Not all HTML documents will conform to this. — wberry
– wberry, Commented Dec 28, 2011 at 15:29

DRH · Accepted Answer · 2011-12-28 20:20:12Z

1

Based on the error message, it looks like you may be attempting to convert a unicode string into CP 437 (an IBM PC character set). This doesn't appear to be occurring in your function, but could happen when attempting to print the resulting string to your console. I ran a quick test with the input string "® some text" and was able to reproduce the failure when printing the resulting string:

print unescape("&#xae; some text")

You can avoid this by specifying the encoding you want to convert the unicode string to:

print unescape("&#xae; some text").encode('utf-8')

You'll see non-ascii characters if you attempt to print this string to the console, however if you write it to a file and read it in a viewer that supports utf-8 encoded documents, you should see the characters you expect.

answered Dec 28, 2011 at 20:20

DRH

8,4262 gold badges38 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

John Machin · Accepted Answer · 2011-12-28 20:20:08Z

0

You need to post the FULL traceback so that we can see where in YOUR code the error happens. You also need to show us repr(a SMALL piece of data that has this problem) -- your data is at least 348 bytes long.

Based on the initially-supplied information:

You are crashing trying to encode a unicode character using cp437 ...

Either (1) the error is happening somewhere in your displayed code and somebody has kludged your default encoding to be cp437 (don't do that)

or (2) the error is not happening anywhere in the code that you have shown us, it is happening when you try to print some of the results of your function, you are running in a Windows "Command Prompt" window, and so your sys.stdout.encoding is set to some legacy MS-DOS encoding which doesn't support the U+00AE character.

edited Dec 28, 2011 at 20:20

answered Dec 28, 2011 at 20:12

John Machin

83.2k12 gold badges147 silver badges193 bronze badges

Comments

Sirko · Accepted Answer · 2013-08-23 11:21:34Z

0

you need to convert result using encode method ,apply encoding like 'utf-8' , for eg.

strdata =  (result).encode('utf-8')

print strdata

edited Aug 23, 2013 at 11:21

Sirko

74.3k19 gold badges157 silver badges194 bronze badges

answered Aug 23, 2013 at 11:05

Milind Morey

2,12419 silver badges15 bronze badges

Collectives™ on Stack Overflow

issue with converting html entities and encoding

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related