0

As per this thread it's pretty easy to remove all digits from a string in PHP.

For example:

$no_digits = preg_replace('/\d/', '', 'This string contains digits! 1234');

But, I don't want digits removed that are part of HTML charactr codes such as:

)
©

How can I get Regex to ignore numbers that are part of a HTML character code? i.e. numbers that are sandwiched between &# and ; characters?

2
  • 1
    You'd probably need to include the hex version as well. Commented Aug 3, 2016 at 17:17
  • And you could probably just use the whole entity construct without being specific. '~(?i:[&%](?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)));(*SKIP)(*FAIL)|\d+)~' Commented Aug 3, 2016 at 17:31

4 Answers 4

3

You can use (*SKIP)(*F) verb:

echo preg_replace('/&#\d+;(*SKIP)(*F)|\d+/', '', 
                  'This string contains digits! 1234 ) © 5678');
//=> This string contains digits!  ) ©

&#\d+;(*SKIP)(*F) will skip the match id regex matches &#\d+; pattern.

Alternatively you can use lookarounds:

echo preg_replace('/(?<!&#)\d+|\d+(?!;)/', '',
                  'This string contains digits! 1234 &#41; &#169; 5678');

Which means match 1 or digits that are either not preceded by &# OR not followed by ; thus making it skip &#\d+; pattern.

Sign up to request clarification or add additional context in comments.

6 Comments

Does php see *F as *FAIL ?
Yes *F is just a shortcut for *FAIL
@sln Indeed, (*F) = (?!) = (*FAIL).
How about does (*S) = (*SKIP) ?
No that is not valid in PCRE
|
0

You can use

var output = Regex.Replace(input, @"[\d-]", string.Empty);

***The \d identifier simply matches any digit character.

1 Comment

That is javascript, OP is in PHP. Also \d only matches 0-9, single digit, need a quantifier for tens/hunderds/thousands/etc.
0

As an option, you could convert your code to UTF-8 encoding (if it’s not already UTF-8), then convert HTML entities to corresponding characters with html_entity_decode(), then remove numbers with a regexp, then, if needed, convert special characters to corresponding entities again with htmlentities() (in UTF-8, it’s actually enough to escape just a minimal subset of special characters via htmlspecialchars()), then convert code back to your original encoding (if the original string was not in UTF-8).

Comments

-1

You can use look behind and look ahead.

$no_digits = preg_replace('/(?<!&#)\d+(?=[^;\d])/', '', 'This string contains &#41; digits! 1234');

So basically, (?<!&#) tells RegEx to look behind \d+ to make sure that there is no &# and (?=[^;\d]) tells RegEx to look ahead of \d+ to make sure that it is not a semicolon or a number.

I like this solution a bit better as it can be used on most RegEx like in Java and JavaScript.

Hope this helps.

Edit: miss one character <.

6 Comments

This (?!&#)\d+ segment looks ahead for not &# then matches \d which can never be &#
It is a 'look behind'. The RegEx engine first looks for \d. Once found, it looks behind to see if it is not &# then it will trigger \d found. It is a relative thing. :-p
No it's a lookahead.
Hahaha, I see what you guys means. I miss a &lt;. Cheers.
Ok, so (?<!&#)\d+(?=[^;\d]) runs into the same problem as with @anubhava 's alternation regex. You've split the condition which won't work separately. If either one is satisfied seperate but not together, it will miss a valid \d ...example: It won't match 555; nor &#77x
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.