Converting html entities into their values in python

Question

I use this regex on some input,

[^a-zA-Z0-9@#]

However this ends up removing lots of html special characters within the input, such as

#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't 
show up as the actual value..)

is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.

Can you show the regex with some context? Your example above just matches any character that is not one of those characters. Is this part of a larger expression? Are you replacing characters? — Trey Hunner
– Trey Hunner, Commented May 2, 2010 at 23:32
I formatted your text as code so it doesn't get so big any more;-). — Alex Martelli
– Alex Martelli, Commented May 2, 2010 at 23:48
That regular expression you listed simply matches exactly one character that is not a letter, a number, or the symbols @ or #. That seems odd but if that's what you need, the best solution would probably be to convert the special HTML characters using one of the answers given by Alex Martelli or doublep and then run your regular expression to match the single character afterwards. Without knowing the restrictions on your input character set, I cannot tell if my solution to include the special character matching within the regex would actually work well. — Trey Hunner
– Trey Hunner, Commented May 4, 2010 at 7:55

Alex Martelli · Accepted Answer · 2010-05-03 00:03:35Z

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:

import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))

s = '&#227;, &#1606;, &#1588;'
u = xed_re.sub(usub, s)

if your terminal emulator can display arbitrary unicode glyphs, a print u will then show

ã, ن, ش

In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).

If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

score 1 · Accepted Answer · 2010-05-02 23:54:07Z

1

You can adapt the following script:

import htmlentitydefs
import re

def substitute_entity (match):
    name = match.group (1)
    if name in htmlentitydefs.name2codepoint:
        return unichr (htmlentitydefs.name2codepoint[name])
    elif name.startswith ('#'):
        try:
            return unichr (int (name[1:]))
        except:
            pass

    return '?'

print re.sub ('&(#?\\w+);', substitute_entity, 'x &laquo; y &wat; z &#123;')

Produces the following answer here:

x « y ? z {

EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

edited May 2, 2010 at 23:54

answered May 2, 2010 at 23:46

user319799

Comments

Trey Hunner · Accepted Answer · 2010-05-02 23:45:30Z

0

Without knowing what the expression is being used for I can't tell exactly what you need.

This will match special characters or strings of characters excluding letters, digits, @, and #:

[^a-zA-Z0-9@#]*|#[0-9A-Za-z]+;

answered May 2, 2010 at 23:45

Trey Hunner

11.9k4 gold badges58 silver badges125 bronze badges

Collectives™ on Stack Overflow

Converting html entities into their values in python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related