1

Using http://simplehtmldom.sourceforge.net/ I know this could extract the html text:

<?php
include('simple_html_dom.php');
// Create DOM from URL
echo file_get_html('http://www.google.com/')->plaintext; 

?>

But how to delete all the text?

For example, if I have this input HTML:

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>Lore Ipsum</h1>
        <p>
            Lorem ipsum dolor sit amet, consectetuer adipiscing elit.<br/>
            Aenean <em>commodo</em> ligula eget dolor. Aenean massa.
        </p>
    </body>
</html>

I would like to get this output with SimpleHtmlDom:

<html>
    <head>
        <title></title>
    </head>
    <body>
        <h1></h1>
        <p><br/></p>
    </body>
</html>

In other words, I want to keep the structure of the document only.

Please help.

3
  • 1
    (related) Best Methods to parse HTML Commented Jan 21, 2011 at 8:34
  • Please clarify the question. It's unclear what you mean by "HTML Text" and whether "html tags" refers to the actual <html> root node or means any html element. Commented Jan 21, 2011 at 9:26
  • I'm referring to the plaintext of the of the html file. Commented Jan 22, 2011 at 17:52

2 Answers 2

3

I don't know for sure how to do that with SimpleHtmlDom. From it's manual, I'd assume something like

$html = file_get_html('http://www.google.com/');
foreach( $html->find('text') as $text) {
    $text->plaintext = '';
}

However, you can also use PHP's native DOM parser. It can do XPath queries and should in general be a good deal faster:

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.google.com');
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()') as $textNode) {
    $textNode->parentNode->removeChild($textNode);
}
$dom->formatOutput = TRUE;
echo $dom->saveXML($dom->documentElement);
Sign up to request clarification or add additional context in comments.

Comments

1

Set innertext Property of HTML Element to the Empty String

Using simplehtmldom.php:

$my_html = file_get_html('http://www.google.com/'); 
$my_html->innertext = "";

4 Comments

I'm not that familiar with SimpleHtmlDom, but that would remove the entire innerHTML of the root element, effectively making the page empty, wouldn't it?
@Gordon, the question was "But how to delete all the text?". If the asker only wants to delete some text, he/she will have to drill down to the relevant element.
But Elements are not Text. In a proper DOM implementation, these are different things. I've asked the OP to clarify his terminology. It's too vague and ambiguous right now.
@Gordon:I have try this, like u said, it effectively make page empty, but the plaintext on html file is still there.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.