1

I was looking through someone else's old code and having some trouble understanding it.

He has:

explode(' ', strtolower(preg_replace('/[^a-z0-9-]+/i', ' ', preg_replace('/\&#?[a-z0-9]{2,4}\;/', ' ', preg_replace('/<[^>]+>/', ' ', $texts)))));

I think the first regex exclude the a-z and 0-9, I am not sure what the second regex does though. The third one matches anything inside the '< >' except '>'

The result will output an array with every word in the $texts variable, however, I just don't know how the codes produce this. I do understand what preg_replace and other functions do , just don't know how the process works

2
  • 1
    That many nested preg_replace calls is just going to lead to confusion Commented Mar 19, 2013 at 23:30
  • 1
    Break it up into three separate statements, using temporary variables. Then it gets easier to follow. Commented Mar 19, 2013 at 23:31

2 Answers 2

4

The expression /[^a-z0-9-]+/i will match (and subsequently replace with empty space) any character except a-z and 0-9. The ^ in [^...] means to negate the character set contained therein.

  • [^a-z0-9] matches any non alphanumeric character
  • + means one or more of the preceding
  • /i makes it match case-insensitively

The expression /\&#?[a-z0-9]{2,4}\;/ matches a & followed optionally by #, followed by two to four letters and numbers, ending with a ; This would match HTML entities like &nbsp; or &#39;

  • &#? matches either & or &# since ? makes the preceding # optional The & doesn't actually need escaping.
  • [a-z0-9]{2,4} matches between two and four alphanumeric characters
  • ; is the literal semicolon. It doesn't actually need escaping.

Partly as you suspected, the last one will replace any tags like <tagname> or <tagname attr='value'> or </tagname> with an empty space. Note that it matches the whole tag, not just the inner contents of <>.

  • < is the literal character
  • [^>]+ is every character up to but not including the next >
  • > is the literal character

I would really recommend rewriting this as three separate calls to preg_replace() rather than nesting them.

// Strips tags.  
// Would be better done with strip_tags()!!
$texts = preg_replace('/<[^>]+>/', ' ', $texts);
// Removes HTML entities
$texts = preg_replace('/&#?[a-z0-9]{2,4};/', ' ', $texts);
// Removes remainin non-alphanumerics
$texts = preg_replace('/[^a-z0-9-]+/i', ' ', $texts);
$array = explode(' ', $texts);
Sign up to request clarification or add additional context in comments.

1 Comment

...matches a & optionally followed by #?
2

This code looks like it...

  1. strips HTML/XML tags (anything between < and >)
  2. then anything that starts with & or &# and is 2-4 characters long (alpha numeric)
  3. then strips anything that is not an alphanumeric or a dash

In processing order of nesting

/<[^>]+>/

Match the character “<” literally «<»
Match any character that is NOT a “>” «[^>]+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the character “>” literally «>»


/\&#?[a-z0-9]{2,4}\;/

Match the character “&” literally «\&»
Match the character “#” literally «#?»
   Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character present in the list below «[a-z0-9]{2,4}»
   Between 2 and 4 times, as many times as possible, giving back as needed (greedy) «{2,4}»
   A character in the range between “a” and “z” «a-z»
   A character in the range between “0” and “9” «0-9»
Match the character “;” literally «\;»


/[^a-z0-9-]+/i

Options: case insensitive

Match a single character NOT present in the list below «[^a-z0-9-]+»
   Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   A character in the range between “a” and “z” «a-z»
   A character in the range between “0” and “9” «0-9»
   The character “-” «-»

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.