1

I have a complex HTML string similar to:

some text <blockquote>main text<blockquote>quotation</blockquote>end of main text</blockquote> some other text

Using PHP I want to extract the entire content of the first blockquote, even if that includes other blockquotes:

main text<blockquote>quotation</blockquote>end of main text

The difficult part is I need to stop cutting the string at the right closing tag - the one belonging to the first opening tag (in this example, the last - but this must be dynamically determined).

This is the attempt I have so far:

<?php

$some_html = "<blockquote>main text<blockquote>quotation</blockquote>end of main text</blockquote>";
$result =  get_first_element_of_HTML_tag_name($some_html,'blockquote');

function get_first_element_of_HTML_tag_name($html_string,$tag_name) {
    $h = strtolower($html_string);
    $tag_open = "<" . $tag_name . ">";
    $tag_close = "</" . $tag_name . ">";

    $element_start = strpos($h,$tag_open)+strlen($tag_open);
    $element_end = strpos($h,$tag_close);

    $element = substr($h,$element_start,$element_end); // cut to first closing tag
    $element_s = $element;
    $i = 2;
    while ( strpos($element_s,"<blockquote") !== false ) { // as long as substring contains another opening tag
        // include another closing tag in the result
        $element = substr($h,$element_start,nth_strpos($h,$element_end,$i));
        $element_s = substr( $element_s, strpos($element_s,$tag_open)+strlen($tag_open), nth_strpos($element_s,strpos($element_s,$tag_close),$i));
        $i++;
    } 
    return $hs; // return complete first element with $tag_name
}

function nth_strpos($str, $substr, $n) { 
    $ct = 0; 
    $pos = 0; 
    while ( ( $pos = strpos($str, $substr, $pos) ) !== false ) { 
        if (++$ct == $n) { 
            return $pos; 
        } 
        $pos++; 
    } 
    return false; 
}  

php?>

$result is returning blank...

It's stuck somewhere in the nth_strpos function, I think.

Help or even simpler alternatives much appreciated!

4
  • 4
    Why aren't you using a DOM parser library? They've solved this problem for you. Commented Oct 10, 2013 at 23:35
  • Agree, writing an HTML parser is no easy job, try with regex /s. Commented Oct 10, 2013 at 23:41
  • @elclanrs No! Don't try with regex. Use a DOM parser. Commented Oct 11, 2013 at 0:20
  • DOMDocument is the best approach. no headaches, just a slight learning curve. don't reinvent the wheel unless your wheel is better than what's on offer Commented Oct 11, 2013 at 1:03

1 Answer 1

1

As Barmar suggested, you should probably use a DOM parser. It just so happens that there's a DOM API that comes with PHP 5 that'll allow you to do this pretty easily. Here's an example:

$str = "some text <blockquote>main text<blockquote>quotation</blockquote>end of main text</blockquote> some other text";
$doc = new DOMDocument();
$doc->loadHTML($str);
$element = $doc->getElementsByTagName("blockquote")->item(0);
$innerHTML= '';
foreach ($element->childNodes as $child)
    $innerHTML .= $doc->saveXML($child);
echo $innerHTML;

Output:

main text<blockquote>quotation</blockquote>end of main text
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.