PHP Extract all text from html page

Question

I have been scratching my head over it for past 1 hour. Is there any reliable way to extract ONLY text

and nothing else (code,images,link,styles,script) from a html page. I am trying to extract all the text inside body of html document.

This includes paragraphs,plain text and tabular data..

So far I have tried simplehtmldom parser and also file_get_contents but both of them are not working. Here is code:

<?php

require_once "simple_html_dom.php";

function getplaintextintrofromhtml($html) {

    // Remove the HTML tags
    $html = strip_tags($html);

    // Convert HTML entities to single characters
    $html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');

    return $html;

}

$html = file_get_contents('http://www.thefreedictionary.com/contempt');

echo getplaintextintrofromhtml($html);
?>

Here is screenshot of output:

https://docs.google.com/file/d/0B-b63LoI1gSfaGhpR0NvdUtlbW8/edit?usp=drivesdk

As you can see it is displaying weird output and not even displaying whole page text

what are you trying to extract? its unclear. what should be the final output? the contents inside <head></head>? — Kevin
– Kevin, Commented Nov 25, 2014 at 10:48
I'm not sure what you want to accomplish: if you get the text, you'd get all the menus, titles etc concatenated? I'd say, make a small example with some input, and the desired output, and what you mean by "notworking". What is different from your expected output? Put some effort in both your project and your question. see: stackoverflow.com/help/mcve — Nanne
– Nanne, Commented Nov 25, 2014 at 10:55
I've played around with this and found that some info is in script tags which aren't being removed with strip_tags the other characters are ASCII and not UTF-8 so use the following to remove those chars as well: iconv('UTF-8', 'ASCII//IGNORE', $string). Hope that helps. — w3shivers
– w3shivers, Commented Nov 25, 2014 at 11:26

Kevin · Accepted Answer · 2014-11-25 10:55:49Z

2

I don't why you think SimpleHTMLDOM doesn't work but you just have to use it properly, just target the body, then use the ->innertext attribute:

function getplaintextintrofromhtml($url) {
    include 'simple_html_dom.php';

    $html = file_get_html($url);
    // point to the body, then get the innertext
    $data = $html->find('body', 0)->innertext;
    return $data;
}

echo getplaintextintrofromhtml('http://www.thefreedictionary.com/contempt');

answered Nov 25, 2014 at 10:55

Kevin

41.9k12 gold badges57 silver badges72 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

srinath madusanka · Accepted Answer · 2014-11-25 10:52:29Z

1

i think PHP Simple HTML DOM Parser is quickest and easy way to do that try http://simplehtmldom.sourceforge.net/

features
A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
Require PHP 5+.
Supports invalid HTML.
Find tags on an HTML page with selectors just like jQuery.
Extract contents from HTML in a single line

answered Nov 25, 2014 at 10:52

srinath madusanka

5927 silver badges20 bronze badges

Comments

Paulius Jacionis · Accepted Answer · 2017-03-27 10:18:52Z

0

Html2Text is a good library just for that.

https://github.com/mtibben/html2text

Install using composer:

composer require html2text/html2text

Basic usage:

$html = new \Html2Text\Html2Text('Hello, &quot;<b>world</b>&quot;');

echo $html->getText();  // Hello, "WORLD"

answered Mar 27, 2017 at 10:18

Paulius Jacionis

4695 silver badges7 bronze badges

Collectives™ on Stack Overflow

PHP Extract all text from html page

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related