I have been scratching my head over it for past 1 hour. Is there any reliable way to extract ONLY text
and nothing else (code,images,link,styles,script) from a html page. I am trying to extract all the text inside body of html document.
This includes paragraphs,plain text and tabular data..
So far I have tried simplehtmldom parser and also file_get_contents but both of them are not working. Here is code:
<?php
require_once "simple_html_dom.php";
function getplaintextintrofromhtml($html) {
// Remove the HTML tags
$html = strip_tags($html);
// Convert HTML entities to single characters
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');
return $html;
}
$html = file_get_contents('http://www.thefreedictionary.com/contempt');
echo getplaintextintrofromhtml($html);
?>
Here is screenshot of output:
https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk
As you can see it is displaying weird output and not even displaying whole page text
<head></head>?strip_tagsthe other characters are ASCII and not UTF-8 so use the following to remove those chars as well:iconv('UTF-8', 'ASCII//IGNORE', $string). Hope that helps.