The expression /[^a-z0-9-]+/i will match (and subsequently replace with empty space) any character except a-z and 0-9. The ^ in [^...] means to negate the character set contained therein.
[^a-z0-9] matches any non alphanumeric character
+ means one or more of the preceding
/i makes it match case-insensitively
The expression /\&#?[a-z0-9]{2,4}\;/ matches a & followed optionally by #, followed by two to four letters and numbers, ending with a ; This would match HTML entities like or '
&#? matches either & or &# since ? makes the preceding # optional The & doesn't actually need escaping.
[a-z0-9]{2,4} matches between two and four alphanumeric characters
; is the literal semicolon. It doesn't actually need escaping.
Partly as you suspected, the last one will replace any tags like <tagname> or <tagname attr='value'> or </tagname> with an empty space. Note that it matches the whole tag, not just the inner contents of <>.
< is the literal character
[^>]+ is every character up to but not including the next >
> is the literal character
I would really recommend rewriting this as three separate calls to preg_replace() rather than nesting them.
// Strips tags.
// Would be better done with strip_tags()!!
$texts = preg_replace('/<[^>]+>/', ' ', $texts);
// Removes HTML entities
$texts = preg_replace('/&#?[a-z0-9]{2,4};/', ' ', $texts);
// Removes remainin non-alphanumerics
$texts = preg_replace('/[^a-z0-9-]+/i', ' ', $texts);
$array = explode(' ', $texts);