PHP regexp parse HTML

Question

My regexp:

<([a-zA-Z0-9]+)>[\na-zA-Z0-9]*<\/\1+>

my string:

<div>
<f>
</f>
</div>

the result is:

array(2
  0 =>  array(1
  0 =>  <f>
</f>
)
1   =>  array(1
0   =>  f
)
)

why it is capturing <f></f>, and ignoring the first <div> ?

html CANNOT be parsed with regexes except for vey simple things. You are trying to parse a whole html fragment with regex, which cannot be done, except if you apply regex recursively (meaning inside the xml labels new HTML fragments can also be present, which CANNOT be done with a single regex) — Nikos M.
– Nikos M., Commented Dec 13, 2015 at 15:14
@NikosM.: it's false, pcre (the regex engine used by PHP) has a recursion feature. — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Dec 13, 2015 at 15:22
@CasimiretHippolyte, true but not enough to parse html (except for simple things) — Nikos M.
– Nikos M., Commented Dec 13, 2015 at 15:25
@Jan, one can indeed parse many things with regexes, it depends of course on what is meant by parsing, some things cannot be parsed by regexes (think of nested html fragments where tag attributes are in random order in each case, to give a simple yet usual example) — Nikos M.
– Nikos M., Commented Dec 13, 2015 at 17:39

Jan · Accepted Answer · 2015-12-13 19:38:09Z

2

The answer is USE A PARSER INSTEAD (sorry for my shouting). While it is sometimes faster to use a regular expression to obtain an ID or URL string, html tags need a rather error-prone way of understanding via regex. Consider the following code, isn't that much more beautiful than druidic characters with special meanings?

<?php
$str = "
<container>
    <div class='someclass' data='somedata'>
        <f>some content here</f>
    </div>
</container>";
$xml = simplexml_load_string($str);

echo $xml->div->f; // some content here
$attributes = $xml->div->attributes();
print_r($attributes); // class and data as keys
?>

edited Dec 13, 2015 at 19:38

answered Dec 13, 2015 at 16:03

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Nikos M. Over a year ago

i would agree although user most probably wants a regex-based approach (even if suboptimal)

Oskari3000 · Accepted Answer · 2015-12-13 15:11:05Z

0

I'd say it's because your second character class statement tries to find 0 or more of the characters before the ending tag comes, and that doesn't match with the <div>...</div> block.

answered Dec 13, 2015 at 15:11

Oskari3000

1316 bronze badges

Collectives™ on Stack Overflow

PHP regexp parse HTML

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related