2

I want to remove empty paragraphs from an HTML document using simple_html_dom.php. I know how to do it using the DOMDocument class, but, because the HTML files I work with are prepared in MS Word, the DOMDocument's loadHTMLFile() function gives this exception "Namespaces are not defined".

This is the code I use with the DOMDocument object for HTML files not prepared in MS Word:

<?php
/* Using the DOMDocument class */

/* Create a new DOMDocument object. */
$html = new DOMDocument("1.0", "UTF-8");

/* Load HTML code from an HTML file into the DOMDocument. */
$html->loadHTMLFile("HTML File With Empty Paragraphs.html");

/* Assign all the <p> elements into the $pars DOMNodeList object. */
$pars = $html->getElementsByTagName("p");

echo "The initial number of paragraphs is " . $pars->length . ".<br />";

/* The trim() function is used to remove leading and trailing spaces as well as
* newline characters. */
for ($i = 0; $i < $pars->length; $i++){
    if (trim($pars->item($i)->textContent) == ""){
        $pars->item($i)->parentNode->removeChild($pars->item($i));
        $i--;
    }
}

echo "The final number of paragraphs is " . $pars->length . ".<br />";

// Write the HTML code back into an HTML file.
$html->saveHTMLFile("HTML File WithOut Empty Paragraphs.html");
?>

This is the code I use with the simple_html_dom.php module for HTML files prepared in MS Word:

<?php
/* Using simple_html_dom.php */

include("simple_html_dom.php");

$html = file_get_html("HTML File With Empty Paragraphs.html");

$pars = $html->find("p");

for ($i = 0; $i < count($pars); $i++) {
    if (trim($pars[$i]->plaintext) == "") {
        unset($pars[$i]);
        $i--;
    }
}

$html->save("HTML File without Empty Paragraphs.html");
?>

It is almost the same, except that that the $pars variable is a DOMNodeList when using DOMDocument and an array when using simple_html_dom.php. But this code does not work. First it runs for two minutes and then reports these errors: "Undefined offset: 1" and "Trying to get property of nonobject" for this line: "if (trim($pars[$i]->plaintext) == "") {".

Does anyone know how I can fix this?

Thank you.

I also asked on php devnetwork.

3
  • I guess the line if (trim($pars->item($i)->textContent == "")){ in the first code block you posted should be if (trim($pars->item($i)->textContent) == ""){ Commented Sep 18, 2010 at 9:16
  • ps: same in the second code block if (trim($pars[$i]->plaintext == "")) { => if (trim($pars[$i]->plaintext) == "") { ;) Commented Sep 18, 2010 at 9:17
  • @DaNiel, thanks for pointing that out, but after fixing it, I get the same results. Commented Sep 19, 2010 at 20:25

2 Answers 2

2
+50

Looking at the documentation for Simple HTML DOM Parser, I think this should do the trick:

include('simple_html_dom.php');

$html = file_get_html('HTML File With Empty Paragraphs.html');
$pars = $html->find('p');

foreach($pars as $par)
{
    if(trim($par->plaintext) == '')
    {
        // Remove an element, set it's outertext as an empty string 
        $par->outertext = '';
    }
}

$html->save('HTML File without Empty Paragraphs.html');

I did a quick test and this works for me:

include('simple_html_dom.php');

$html = str_get_html('<html><body><h1>Test</h1><p></p><p>Test</p></body></html>');
$pars = $html->find("p");

foreach($pars as $par)
{
    if(trim($par->plaintext) == '')
    {
        $par->outertext = '';
    }
}

echo $html;
// Output: <html><body><h1>Test</h1><p>Test</p></body></html>
Sign up to request clarification or add additional context in comments.

Comments

0

Empty paragraphs looks like <p [attributes]> [spaces or newlines] </p> (case-insensitive). You can use preg_replace (or str_replace) for removing empty paragraphs.

The following will only work if an empty paragraph is <p></p>:

$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = str_replace('<p></p>', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);

This will also work on paragraphs with attributes, like <p class="msoNormal"> </p>:

$oldHtml = file_get_contents('File With Empty Paragraphs.html');
$newHtml = preg_replace('#<p[^>]*>\s*</p>#i', '', $oldHtml);
// and write the new HTML to the file
$fh = fopen('File Without Empty Paragraphs.html', 'w');
fwrite($fh, $newHtml);
fclose($fh);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.