Pure PHP, extract HTML content from complex HTML string

Question

I have a complex HTML string similar to:

some text <blockquote>main text<blockquote>quotation</blockquote>end of main text</blockquote> some other text

Using PHP I want to extract the entire content of the first blockquote, even if that includes other blockquotes:

main text<blockquote>quotation</blockquote>end of main text

The difficult part is I need to stop cutting the string at the right closing tag - the one belonging to the first opening tag (in this example, the last - but this must be dynamically determined).

This is the attempt I have so far:

<?php

$some_html = "<blockquote>main text<blockquote>quotation</blockquote>end of main text</blockquote>";
$result =  get_first_element_of_HTML_tag_name($some_html,'blockquote');

function get_first_element_of_HTML_tag_name($html_string,$tag_name) {
    $h = strtolower($html_string);
    $tag_open = "<" . $tag_name . ">";
    $tag_close = "</" . $tag_name . ">";

    $element_start = strpos($h,$tag_open)+strlen($tag_open);
    $element_end = strpos($h,$tag_close);

    $element = substr($h,$element_start,$element_end); // cut to first closing tag
    $element_s = $element;
    $i = 2;
    while ( strpos($element_s,"<blockquote") !== false ) { // as long as substring contains another opening tag
        // include another closing tag in the result
        $element = substr($h,$element_start,nth_strpos($h,$element_end,$i));
        $element_s = substr( $element_s, strpos($element_s,$tag_open)+strlen($tag_open), nth_strpos($element_s,strpos($element_s,$tag_close),$i));
        $i++;
    } 
    return $hs; // return complete first element with $tag_name
}

function nth_strpos($str, $substr, $n) { 
    $ct = 0; 
    $pos = 0; 
    while ( ( $pos = strpos($str, $substr, $pos) ) !== false ) { 
        if (++$ct == $n) { 
            return $pos; 
        } 
        $pos++; 
    } 
    return false; 
}  

php?>

$result is returning blank...

It's stuck somewhere in the nth_strpos function, I think.

Help or even simpler alternatives much appreciated!

Why aren't you using a DOM parser library? They've solved this problem for you. — Barmar
– Barmar, Commented Oct 10, 2013 at 23:35
Agree, writing an HTML parser is no easy job, try with regex /s. — elclanrs
– elclanrs, Commented Oct 10, 2013 at 23:41
DOMDocument is the best approach. no headaches, just a slight learning curve. don't reinvent the wheel unless your wheel is better than what's on offer — gwillie
– gwillie, Commented Oct 11, 2013 at 1:03

sgbj · Accepted Answer · 2013-10-11 00:00:50Z

1

As Barmar suggested, you should probably use a DOM parser. It just so happens that there's a DOM API that comes with PHP 5 that'll allow you to do this pretty easily. Here's an example:

$str = "some text <blockquote>main text<blockquote>quotation</blockquote>end of main text</blockquote> some other text";
$doc = new DOMDocument();
$doc->loadHTML($str);
$element = $doc->getElementsByTagName("blockquote")->item(0);
$innerHTML= '';
foreach ($element->childNodes as $child)
    $innerHTML .= $doc->saveXML($child);
echo $innerHTML;

Output:

main text<blockquote>quotation</blockquote>end of main text

answered Oct 11, 2013 at 0:00

sgbj

2,27418 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pure PHP, extract HTML content from complex HTML string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related