PHP Parse content from url

Question

i need some help regarding this study script im building which im trying to fetch articles from a website.

Currently im able to get the article from 1 element but failing to get all elements, this is an example of the url im trying to fetch

<div class="entry-content">
</div>

<div class="entry-content">
</div>

<div class="entry-content">
</div>

This is my PHP code to get the content of the first div :

function getArticle($url){

    $content = file_get_contents($url);
    $first_step = explode( '<div class="entry-content">' , $content );
    $separate_news = explode("</div>" , $first_step[1] );
    $article = $separate_news[0];

    echo $article;

}

Nick · Accepted Answer · 2018-12-07 20:55:40Z

2

You should really use PHPs DOMDocument class for parsing HTML. In terms of your example code, the problem is that you're not processing all the results from your $first_step array. You could try something like this:

$first_steps = explode( '<div class="entry-content">' , $content );
foreach ($first_steps as $first_step) {
    if (strpos($first_step, '</div>') === false) continue;
    $separate_news = explode("</div>" , $first_step );
    $article = $separate_news[0];
    echo $article;
}

Here's a small demo on 3v4l.org

answered Dec 7, 2018 at 20:55

Nick

147k23 gold badges67 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Goncalo Over a year ago

Amazing work you did here ! I can clearly see it now, it ended up in being stuck with first result . Regarding the DOMDocument ive tried before using vanilla coding but seems like they do not support classes but only ids and element tags

Kubwimana Adrien · Accepted Answer · 2018-12-07 21:03:23Z

1

I have used this library before http://simplehtmldom.sourceforge.net/ . Full documentation is found here http://simplehtmldom.sourceforge.net/manual.htm . It's very easy to use and does a lot more. You could select your articles like:

$html = file_get_html($url);
$articles = $html->find(".entry-content");
foreach($articles as $article) echo $article->plaintext;

answered Dec 7, 2018 at 21:03

Kubwimana Adrien

2,5512 gold badges10 silver badges13 bronze badges

2 Comments

Goncalo Over a year ago

It does not work with latest lib your code: [07-Dec-2018 16:10:07 America/New_York] PHP Fatal error: Call to undefined function file_get_html() in /home/gmtemhic/public_html/index.php on line 19

Kubwimana Adrien Over a year ago

It should work. Looks like the library is not included on your page. Download it here sourceforge.net/projects/simplehtmldom/files/… and include simple_html_dom.php on your page.

trincot · Accepted Answer · 2018-12-07 21:05:48Z

1

You should use DOMDocument. Although it is a bit tricky to select nodes by CSS class, you can do it with DomXPath like this:

$dom = new DomDocument();
$dom->load($url);
$xpath = new DomXPath($dom);
$classname="entry-content";
$nodes = $xpath->query('//*[contains(concat(" ", normalize-space(@class), " "), " entry-content ")]');
foreach($nodes as $node) {
    echo $node->textContent . "\n";
}

The advantage is now also that HTML entities and other HTML that might occur inside the article content is converted as expected. Like & becomes &, and <b>bold</b> just becomes bold.

answered Dec 7, 2018 at 21:05

trincot

357k38 gold badges282 silver badges338 bronze badges

Collectives™ on Stack Overflow

PHP Parse content from url

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related