2

I have been scratching my head over it for past 1 hour. Is there any reliable way to extract ONLY text

and nothing else (code,images,link,styles,script) from a html page. I am trying to extract all the text inside body of html document.

This includes paragraphs,plain text and tabular data..

So far I have tried simplehtmldom parser and also file_get_contents but both of them are not working. Here is code:

<?php

require_once "simple_html_dom.php";

function getplaintextintrofromhtml($html) {

    // Remove the HTML tags
    $html = strip_tags($html);

    // Convert HTML entities to single characters
    $html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

    return $html;

}

$html = file_get_contents('http://www.thefreedictionary.com/contempt');

echo getplaintextintrofromhtml($html);
?>

Here is screenshot of output:

https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk

As you can see it is displaying weird output and not even displaying whole page text

4
  • php.net/manual/en/book.curl.php and strip_tags() Commented Nov 25, 2014 at 10:42
  • what are you trying to extract? its unclear. what should be the final output? the contents inside <head></head>? Commented Nov 25, 2014 at 10:48
  • I'm not sure what you want to accomplish: if you get the text, you'd get all the menus, titles etc concatenated? I'd say, make a small example with some input, and the desired output, and what you mean by "notworking". What is different from your expected output? Put some effort in both your project and your question. see: stackoverflow.com/help/mcve Commented Nov 25, 2014 at 10:55
  • 1
    I've played around with this and found that some info is in script tags which aren't being removed with strip_tags the other characters are ASCII and not UTF-8 so use the following to remove those chars as well: iconv('UTF-8', 'ASCII//IGNORE', $string). Hope that helps. Commented Nov 25, 2014 at 11:26

3 Answers 3

2

I don't why you think SimpleHTMLDOM doesn't work but you just have to use it properly, just target the body, then use the ->innertext attribute:

function getplaintextintrofromhtml($url) {
    include 'simple_html_dom.php';

    $html = file_get_html($url);
    // point to the body, then get the innertext
    $data = $html->find('body', 0)->innertext;
    return $data;
}

echo getplaintextintrofromhtml('http://www.thefreedictionary.com/contempt');
Sign up to request clarification or add additional context in comments.

Comments

1

i think PHP Simple HTML DOM Parser is quickest and easy way to do that try http://simplehtmldom.sourceforge.net/

features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line

Comments

0

Html2Text is a good library just for that.

https://github.com/mtibben/html2text

Install using composer:

composer require html2text/html2text

Basic usage:

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.