2

I use this regex on some input,

[^a-zA-Z0-9@#]

However this ends up removing lots of html special characters within the input, such as

#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't 
show up as the actual value..)

is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.

5
  • You could have replaced the & prefix with & Commented May 2, 2010 at 23:29
  • Can you show the regex with some context? Your example above just matches any character that is not one of those characters. Is this part of a larger expression? Are you replacing characters? Commented May 2, 2010 at 23:32
  • I formatted your text as code so it doesn't get so big any more;-). Commented May 2, 2010 at 23:48
  • Trey, the regexp shown is the only regex i'm using Commented May 3, 2010 at 1:23
  • That regular expression you listed simply matches exactly one character that is not a letter, a number, or the symbols @ or #. That seems odd but if that's what you need, the best solution would probably be to convert the special HTML characters using one of the answers given by Alex Martelli or doublep and then run your regular expression to match the single character afterwards. Without knowing the restrictions on your input character set, I cannot tell if my solution to include the special character matching within the regex would actually work well. Commented May 4, 2010 at 7:55

3 Answers 3

4

Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:

import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))

s = 'ã, ن, ش'
u = xed_re.sub(usub, s)

if your terminal emulator can display arbitrary unicode glyphs, a print u will then show

ã, ن, ش

In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).

If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).

Sign up to request clarification or add additional context in comments.

Comments

1

You can adapt the following script:

import htmlentitydefs
import re

def substitute_entity (match):
    name = match.group (1)
    if name in htmlentitydefs.name2codepoint:
        return unichr (htmlentitydefs.name2codepoint[name])
    elif name.startswith ('#'):
        try:
            return unichr (int (name[1:]))
        except:
            pass

    return '?'

print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')

Produces the following answer here:

x « y ? z {

EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)

Comments

0

Without knowing what the expression is being used for I can't tell exactly what you need.

This will match special characters or strings of characters excluding letters, digits, @, and #:

[^a-zA-Z0-9@#]*|#[0-9A-Za-z]+;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.