PHP regex and preg_replace issue

Question

I was looking through someone else's old code and having some trouble understanding it.

He has:

explode(' ', strtolower(preg_replace('/[^a-z0-9-]+/i', ' ', preg_replace('/\&#?[a-z0-9]{2,4}\;/', ' ', preg_replace('/<[^>]+>/', ' ', $texts)))));

I think the first regex exclude the a-z and 0-9, I am not sure what the second regex does though. The third one matches anything inside the '< >' except '>'

The result will output an array with every word in the $texts variable, however, I just don't know how the codes produce this. I do understand what preg_replace and other functions do , just don't know how the process works

That many nested preg_replace calls is just going to lead to confusion — Scuzzy
– Scuzzy, Commented Mar 19, 2013 at 23:30
Break it up into three separate statements, using temporary variables. Then it gets easier to follow. — mario
– mario, Commented Mar 19, 2013 at 23:31

Michael Berkowski · Accepted Answer · 2013-03-19 23:44:23Z

4

The expression /[^a-z0-9-]+/i will match (and subsequently replace with empty space) any character except a-z and 0-9. The ^ in [^...] means to negate the character set contained therein.

[^a-z0-9] matches any non alphanumeric character
+ means one or more of the preceding
/i makes it match case-insensitively

The expression /\&#?[a-z0-9]{2,4}\;/ matches a & followed optionally by #, followed by two to four letters and numbers, ending with a ; This would match HTML entities like   or '

&#? matches either & or &# since ? makes the preceding # optional The & doesn't actually need escaping.
[a-z0-9]{2,4} matches between two and four alphanumeric characters
; is the literal semicolon. It doesn't actually need escaping.

Partly as you suspected, the last one will replace any tags like <tagname> or <tagname attr='value'> or </tagname> with an empty space. Note that it matches the whole tag, not just the inner contents of <>.

< is the literal character
[^>]+ is every character up to but not including the next >
> is the literal character

I would really recommend rewriting this as three separate calls to preg_replace() rather than nesting them.

// Strips tags.  
// Would be better done with strip_tags()!!
$texts = preg_replace('/<[^>]+>/', ' ', $texts);
// Removes HTML entities
$texts = preg_replace('/&#?[a-z0-9]{2,4};/', ' ', $texts);
// Removes remainin non-alphanumerics
$texts = preg_replace('/[^a-z0-9-]+/i', ' ', $texts);
$array = explode(' ', $texts);

edited Mar 19, 2013 at 23:44

answered Mar 19, 2013 at 23:30

Michael Berkowski

271k47 gold badges450 silver badges395 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Jan Turoň Over a year ago

...matches a & optionally followed by #?

Scuzzy · Accepted Answer · 2013-03-19 23:39:53Z

This code looks like it...

strips HTML/XML tags (anything between < and >)
then anything that starts with & or &# and is 2-4 characters long (alpha numeric)
then strips anything that is not an alphanumeric or a dash

In processing order of nesting

/<[^>]+>/

Match the character “<” literally «<»
Match any character that is NOT a “>” «[^>]+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “>” literally «>»


/\&#?[a-z0-9]{2,4}\;/

Match the character “&” literally «\&»
Match the character “#” literally «#?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character present in the list below «[a-z0-9]{2,4}»
   Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}»
   A character in the range between “a” and “z” «a-z»
   A character in the range between “0” and “9” «0-9»
Match the character “;” literally «\;»


/[^a-z0-9-]+/i

Options: case insensitive

Match a single character NOT present in the list below «[^a-z0-9-]+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   A character in the range between “a” and “z” «a-z»
   A character in the range between “0” and “9” «0-9»
   The character “-” «-»

Collectives™ on Stack Overflow

PHP regex and preg_replace issue

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related