How to delete HTML text between html tags in PHP using SimpleHtmlDom

Question

Using http://simplehtmldom.sourceforge.net/ I know this could extract the html text:

<?php
include('simple_html_dom.php');
// Create DOM from URL
echo file_get_html('http://www.google.com/')->plaintext; 

?>

But how to delete all the text?

For example, if I have this input HTML:

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>Lore Ipsum</h1>
        <p>
            Lorem ipsum dolor sit amet, consectetuer adipiscing elit.<br/>
            Aenean <em>commodo</em> ligula eget dolor. Aenean massa.
        </p>
    </body>
</html>

I would like to get this output with SimpleHtmlDom:

<html>
    <head>
        <title></title>
    </head>
    <body>
        <h1></h1>
        <p><br/></p>
    </body>
</html>

In other words, I want to keep the structure of the document only.

Please help.

Please clarify the question. It's unclear what you mean by "HTML Text" and whether "html tags" refers to the actual <html> root node or means any html element. — Gordon
– Gordon, Commented Jan 21, 2011 at 9:26

Anthony Pegram · Accepted Answer · 2011-12-25 05:19:23Z

3

I don't know for sure how to do that with SimpleHtmlDom. From it's manual, I'd assume something like

$html = file_get_html('http://www.google.com/');
foreach( $html->find('text') as $text) {
    $text->plaintext = '';
}

However, you can also use PHP's native DOM parser. It can do XPath queries and should in general be a good deal faster:

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.google.com');
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()') as $textNode) {
    $textNode->parentNode->removeChild($textNode);
}
$dom->formatOutput = TRUE;
echo $dom->saveXML($dom->documentElement);

edited Dec 25, 2011 at 5:19

Anthony Pegram

128k28 gold badges229 silver badges252 bronze badges

answered Jan 21, 2011 at 8:33

Gordon

318k76 gold badges548 silver badges566 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Geoffrey · Accepted Answer · 2011-01-21 08:54:51Z

1

Set `innertext` Property of HTML Element to the Empty String

Using simplehtmldom.php:

$my_html = file_get_html('http://www.google.com/'); 
$my_html->innertext = "";

answered Jan 21, 2011 at 8:54

Geoffrey

5,42910 gold badges49 silver badges81 bronze badges

4 Comments

Gordon Over a year ago

I'm not that familiar with SimpleHtmlDom, but that would remove the entire innerHTML of the root element, effectively making the page empty, wouldn't it?

Geoffrey Over a year ago

@Gordon, the question was "But how to delete all the text?". If the asker only wants to delete some text, he/she will have to drill down to the relevant element.

Gordon Over a year ago

But Elements are not Text. In a proper DOM implementation, these are different things. I've asked the OP to clarify his terminology. It's too vague and ambiguous right now.

woninana Over a year ago

@Gordon:I have try this, like u said, it effectively make page empty, but the plaintext on html file is still there.

Collectives™ on Stack Overflow

How to delete HTML text between html tags in PHP using SimpleHtmlDom

2 Answers 2

Comments

Set `innertext` Property of HTML Element to the Empty String

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Set innertext Property of HTML Element to the Empty String

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Set `innertext` Property of HTML Element to the Empty String