6

I need to load some arbitrary HTML into an existing DOMDocument tree. Previous answers suggest using DOMDocumentFragment and its appendXML method to handle this.

As @Owlvark indicates in the comments, xml is not html and therefore this is not a good solution.

The main issue that I had with it was that entities like &ndash were causing errors because the appendXML method expects well formed XML.

We could define the entities, but this doesn't take care of the problem that not all html is valid xml.

What is a good solution for importing HTML into a DOMDocument tree?

6
  • 1
    You might just have to turn on libxml_use_internal_errors() and ignore it... Also, you're loading the document using DomDocument::loadHtml() right? Commented Sep 11, 2012 at 19:38
  • 1
    @FrankFarmer, the internal errors just suppresses the errors visually or from your error handler, it does nothing to actually resolve the issue. As for loadHtml, I am not. I am using the DOMDocumentFragment::appendXML Commented Sep 11, 2012 at 19:41
  • 1
    See this answer - HTML is not XML Commented Sep 11, 2012 at 19:44
  • @Owlvark joy, that explains the error... but it also doesn't provide a viable solution. Commented Sep 11, 2012 at 19:48
  • You have been given two "solutions" (suppressing errors, defining entities), what makes them not "viable"?.. Commented Sep 11, 2012 at 20:09

1 Answer 1

7

The solution that I came up with is to use DomDocument::loadHtml as @FrankFarmer suggests and then to take the parsed nodes and import them into my current document. My implementation looks like this

/**
* Parses HTML into DOMElements
* @param string $html the raw html to transform
* @param \DOMDocument $doc the document to import the nodes into
* @return array an array of DOMElements on success or an empty array on failure
*/
protected function htmlToDOM($html, $doc) {
     $html = '<div id="html-to-dom-input-wrapper">' . $html . '</div>';
     $hdoc = DOMDocument::loadHTML($html);
     $child_array = array();
     try {
         $children = $hdoc->getElementById('html-to-dom-input-wrapper')->childNodes;
         foreach($children as $child) {
             $child = $doc->importNode($child, true);
             array_push($child_array, $child);
         }
     } catch (Exception $ex) {
         error_log($ex->getMessage(), 0);
     }
     return $child_array;
 }
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.