0

I want to find all <h3> blocks in this example:

<h3>sdf</h3>
sdfsdf
<h3>sdf</h3>
32
<h2>fs</h2>
<h3>23sd</h3>
234
<h1>h1</h1>

(From h3 to other h3 or h2) This regexp find only first h3 block

~\<h3[^>]*\>[^>]+\<\/h3\>.+(?:\<h3|\<h2|\<h1)~is

I use php function preg_match_all (Quote from docs: After the first match is found, the subsequent searches are continued on from end of the last match.)

What i have to modify in my regexp?

ps

<h3>1</h3>
1content
<h3>2</h3>
2content
<h2>h2</h2>
<h3>3</h3>
3content
<h1>h1</h1>

this content have to be parsed as:

[0] => <h3>1</h3>1content
[1] => <h3>2</h3>2content
[2] => <h3>2</h3>3content
8
  • 9
    Don't use regexes for parsing HTML Commented Apr 4, 2014 at 0:45
  • not sure i really undestand your issue Commented Apr 4, 2014 at 0:46
  • Thanks for your answer, but I parse my own page with defined structure. Commented Apr 4, 2014 at 0:47
  • 3
    Please take a look at the DomDocument class. If you parse your HTML, you can easily query all the heading three blocks. Commented Apr 4, 2014 at 0:58
  • 1
    Questions about parsing HTML with PHP/regex come up so often in SO. Let me echo what has already been said - don't do that. There are many far more able and useful tools for this problem. Look at PHP internal classes DOMDocument and DOMXPath for starters. Make life easier for yourself :) Commented Apr 4, 2014 at 1:09

3 Answers 3

1

with DOMDocument:

$dom = new DOMDocument();
@$dom->loadHTML($html);

$nodes = $dom->getElementsByTagName('body')->item(0)->childNodes;

$flag = false;
$results = array();

foreach ($nodes as $node) {
    if ( $node->nodeType == XML_ELEMENT_NODE &&
         preg_match('~^h(?:[12]|(3))$~i', $node->nodeName, $m) ):
        if ($flag)
            $results[] = $tmp;
        if (isset($m[1])) {
            $tmp = $dom->saveXML($node);
            $flag = true;
        } else
            $flag = false;

    elseif ($flag):
        $tmp .= $dom->saveXML($node);

    endif;
}

echo htmlspecialchars(print_r($results, true));

with regex:

preg_match_all('~<h3.*?(?=<h[123])~si', $html, $matches);

echo htmlspecialchars(print_r($matches[0], true));
Sign up to request clarification or add additional context in comments.

Comments

1

You shouldn't use Regex to parse HTML if there is any nesting involved.

Regex

(<(h\d)>.*?<\/\2>)[\r\n]([^\r\n<]+)

Replacement

\1\3
or
$1$3

http://regex101.com/r/uQ3uC2

Comments

0
preg_match_all('/<h3>(.*?)<\/h3>/is', $stringHTML, $matches);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.