PHP DOMDocument: Get inner HTML of node

Question

When loading HTML into an <textarea>, I intend to treat different kinds of links differently. Consider the following links:

<a href="http://stackoverflow.com">http://stackoverflow.com</a>
<a href="http://stackoverflow.com">StackOverflow</a>

When the text inside a link matches its href attribute, I want to remove the HTML, otherwise the HTML remains unchanged.

Here's my code:

$body = "Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a>";

$dom = new DOMDocument;
$dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($dom->getElementsByTagName('a') as $node) {
    $link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
    $link_href = $node->getAttribute("href");
    $link_node = $dom->createTextNode($link_href);

    $node->parentNode->replaceChild($link_node, $node);
}

$html = $dom->saveHTML();

The problem with the above code is that DOMDocument encapsulates my HTML into a paragraph tag:

<p>Some HTML with a http://stackoverflow.com</p>

How do I get it ot only return the inner HTML of that paragraph?

DOMDocument may have a rootNode to work. It creates one if there is no one. You should add a root node before to parse, and remove it manually... Hope there is a better solution. — Syscall
– Syscall, Commented Feb 22, 2018 at 14:29
It makes sense that there needs to be a rootNode. In that case, there might be no way around preg_replace('/(^<p>|<\/p>$)/', '', $html) — idleberg
– idleberg, Commented Feb 22, 2018 at 14:48

Syscall · Accepted Answer · 2018-02-23 13:53:21Z

1

You need to have a root node to have a valid DOM document.

I suggest you to add a root node <div> to avoid to destroy a possibly existing one.

Finally, load the nodeValue of the rootNode or substr().

$body = "Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a>";
$body = '<div>'.$body.'</div>';

$dom = new DOMDocument;
$dom->loadHTML($body, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

foreach ($dom->getElementsByTagName('a') as $node) {
    $link_text = $node->ownerDocument->saveHTML($node->childNodes[0]);
    $link_href = $node->getAttribute("href");
    $link_node = $dom->createTextNode($link_href);

    $node->parentNode->replaceChild($link_node, $node);
}

// or probably better :
$html = $dom->saveHTML() ;
$html = substr($html,5,-7); // remove <div>
var_dump($html); // "Some HTML with a http://stackoverflow.com"

This works is the input string is :

<p>Some HTML with a <a href=\"http://stackoverflow.com\">http://stackoverflow.com</a></p>

outputs :

<p>Some HTML with a http://stackoverflow.com</p>

edited Feb 23, 2018 at 13:53

answered Feb 22, 2018 at 14:55

Syscall

19.8k10 gold badges44 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

idleberg Over a year ago

I would have preferred if there's a DOMDocument way to retrieve the child node. However, I need to preserve some HTML (including some links) and your first method strips all HTML.

Syscall Over a year ago

@idleberg I understand. So I still suggest you to add a root tag, even if there is one, because, you could delete an existing possible one.

Syscall Over a year ago

@idleberg I've updated the anwser. Please, see also the last part.

Collectives™ on Stack Overflow

PHP DOMDocument: Get inner HTML of node

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related