PHP regex remove all digits except character codes

Question

As per this thread it's pretty easy to remove all digits from a string in PHP.

For example:

$no_digits = preg_replace('/\d/', '', 'This string contains digits! 1234');

But, I don't want digits removed that are part of HTML charactr codes such as:

&#41;
&#169;

How can I get Regex to ignore numbers that are part of a HTML character code? i.e. numbers that are sandwiched between &# and ; characters?

And you could probably just use the whole entity construct without being specific. '~(?i:[&%](?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)));(*SKIP)(*FAIL)|\d+)~' — user557597
– user557597, Commented Aug 3, 2016 at 17:31

anubhava · Accepted Answer · 2016-08-03 17:07:27Z

3

You can use (*SKIP)(*F) verb:

echo preg_replace('/&#\d+;(*SKIP)(*F)|\d+/', '', 
                  'This string contains digits! 1234 &#41; &#169; 5678');
//=> This string contains digits!  &#41; &#169;

&#\d+;(*SKIP)(*F) will skip the match id regex matches &#\d+; pattern.

Alternatively you can use lookarounds:

echo preg_replace('/(?<!&#)\d+|\d+(?!;)/', '',
                  'This string contains digits! 1234 &#41; &#169; 5678');

Which means match 1 or digits that are either not preceded by &# OR not followed by ; thus making it skip &#\d+; pattern.

answered Aug 3, 2016 at 17:07

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user557597 Over a year ago

Does php see *F as *FAIL ?

anubhava Over a year ago

Yes *F is just a shortcut for *FAIL

Jan Over a year ago

@sln Indeed, (*F) = (?!) = (*FAIL).

user557597 Over a year ago

How about does (*S) = (*SKIP) ?

anubhava Over a year ago

No that is not valid in PCRE

|

chris85 · Accepted Answer · 2016-08-03 17:37:37Z

0

You can use

var output = Regex.Replace(input, @"[\d-]", string.Empty);

***The \d identifier simply matches any digit character.

edited Aug 3, 2016 at 17:37

chris85

23.9k7 gold badges36 silver badges51 bronze badges

answered Aug 3, 2016 at 17:24

Tijo John

6221 gold badge9 silver badges20 bronze badges

1 Comment

chris85 Over a year ago

That is javascript, OP is in PHP. Also \d only matches 0-9, single digit, need a quantifier for tens/hunderds/thousands/etc.

Marat Tanalin · Accepted Answer · 2016-08-03 21:53:33Z

0

As an option, you could convert your code to UTF-8 encoding (if it’s not already UTF-8), then convert HTML entities to corresponding characters with html_entity_decode(), then remove numbers with a regexp, then, if needed, convert special characters to corresponding entities again with htmlentities() (in UTF-8, it’s actually enough to escape just a minimal subset of special characters via htmlspecialchars()), then convert code back to your original encoding (if the original string was not in UTF-8).

answered Aug 3, 2016 at 21:53

Marat Tanalin

14.2k1 gold badge38 silver badges53 bronze badges

Comments

NawaMan · Accepted Answer · 2016-08-03 18:09:15Z

-1

You can use look behind and look ahead.

$no_digits = preg_replace('/(?<!&#)\d+(?=[^;\d])/', '', 'This string contains &#41; digits! 1234');

So basically, (?<!&#) tells RegEx to look behind \d+ to make sure that there is no &# and (?=[^;\d]) tells RegEx to look ahead of \d+ to make sure that it is not a semicolon or a number.

I like this solution a bit better as it can be used on most RegEx like in Java and JavaScript.

Hope this helps.

Edit: miss one character <.

edited Aug 3, 2016 at 18:09

answered Aug 3, 2016 at 17:37

NawaMan

25.8k11 gold badges54 silver badges77 bronze badges

6 Comments

user557597 Over a year ago

This (?!&#)\d+ segment looks ahead for not &# then matches \d which can never be &#

NawaMan Over a year ago

It is a 'look behind'. The RegEx engine first looks for \d. Once found, it looks behind to see if it is not &# then it will trigger \d found. It is a relative thing. :-p

Casimir et Hippolyte Over a year ago

No it's a lookahead.

NawaMan Over a year ago

Hahaha, I see what you guys means. I miss a <. Cheers.

user557597 Over a year ago

Ok, so (?<!&#)\d+(?=[^;\d]) runs into the same problem as with @anubhava 's alternation regex. You've split the condition which won't work separately. If either one is satisfied seperate but not together, it will miss a valid \d ...example: It won't match 555; nor &#77x

|

Collectives™ on Stack Overflow

PHP regex remove all digits except character codes

4 Answers 4

6 Comments

1 Comment

Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

1 Comment

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related