0

My regexp:

<([a-zA-Z0-9]+)>[\na-zA-Z0-9]*<\/\1+>

my string:

<div>
<f>
</f>
</div>

the result is:

array(2
  0 =>  array(1
  0 =>  <f>
</f>
)
1   =>  array(1
0   =>  f
)
)

why it is capturing <f></f>, and ignoring the first <div> ?

9
  • html CANNOT be parsed with regexes except for vey simple things. You are trying to parse a whole html fragment with regex, which cannot be done, except if you apply regex recursively (meaning inside the xml labels new HTML fragments can also be present, which CANNOT be done with a single regex) Commented Dec 13, 2015 at 15:14
  • Because < and > are not in your second character class. Commented Dec 13, 2015 at 15:20
  • @NikosM.: it's false, pcre (the regex engine used by PHP) has a recursion feature. Commented Dec 13, 2015 at 15:22
  • @CasimiretHippolyte, true but not enough to parse html (except for simple things) Commented Dec 13, 2015 at 15:25
  • 1
    @Jan, one can indeed parse many things with regexes, it depends of course on what is meant by parsing, some things cannot be parsed by regexes (think of nested html fragments where tag attributes are in random order in each case, to give a simple yet usual example) Commented Dec 13, 2015 at 17:39

2 Answers 2

2

The answer is USE A PARSER INSTEAD (sorry for my shouting). While it is sometimes faster to use a regular expression to obtain an ID or URL string, html tags need a rather error-prone way of understanding via regex. Consider the following code, isn't that much more beautiful than druidic characters with special meanings?

<?php
$str = "
<container>
    <div class='someclass' data='somedata'>
        <f>some content here</f>
    </div>
</container>";
$xml = simplexml_load_string($str);

echo $xml->div->f; // some content here
$attributes = $xml->div->attributes();
print_r($attributes); // class and data as keys
?>
Sign up to request clarification or add additional context in comments.

1 Comment

i would agree although user most probably wants a regex-based approach (even if suboptimal)
0

I'd say it's because your second character class statement tries to find 0 or more of the characters before the ending tag comes, and that doesn't match with the <div>...</div> block.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.