0

i need some help regarding this study script im building which im trying to fetch articles from a website.

Currently im able to get the article from 1 element but failing to get all elements, this is an example of the url im trying to fetch

<div class="entry-content">
</div>

<div class="entry-content">
</div>

<div class="entry-content">
</div>

This is my PHP code to get the content of the first div :

function getArticle($url){

    $content = file_get_contents($url);
    $first_step = explode( '<div class="entry-content">' , $content );
    $separate_news = explode("</div>" , $first_step[1] );
    $article = $separate_news[0];

    echo $article;

}

3 Answers 3

2

You should really use PHPs DOMDocument class for parsing HTML. In terms of your example code, the problem is that you're not processing all the results from your $first_step array. You could try something like this:

$first_steps = explode( '<div class="entry-content">' , $content );
foreach ($first_steps as $first_step) {
    if (strpos($first_step, '</div>') === false) continue;
    $separate_news = explode("</div>" , $first_step );
    $article = $separate_news[0];
    echo $article;
}

Here's a small demo on 3v4l.org

Sign up to request clarification or add additional context in comments.

1 Comment

Amazing work you did here ! I can clearly see it now, it ended up in being stuck with first result . Regarding the DOMDocument ive tried before using vanilla coding but seems like they do not support classes but only ids and element tags
1

I have used this library before http://simplehtmldom.sourceforge.net/ . Full documentation is found here http://simplehtmldom.sourceforge.net/manual.htm . It's very easy to use and does a lot more. You could select your articles like:

$html = file_get_html($url);
$articles = $html->find(".entry-content");
foreach($articles as $article) echo $article->plaintext;

2 Comments

It does not work with latest lib your code: [07-Dec-2018 16:10:07 America/New_York] PHP Fatal error: Call to undefined function file_get_html() in /home/gmtemhic/public_html/index.php on line 19
It should work. Looks like the library is not included on your page. Download it here sourceforge.net/projects/simplehtmldom/files/… and include simple_html_dom.php on your page.
1

You should use DOMDocument. Although it is a bit tricky to select nodes by CSS class, you can do it with DomXPath like this:

$dom = new DomDocument();
$dom->load($url);
$xpath = new DomXPath($dom);
$classname="entry-content";
$nodes = $xpath->query('//*[contains(concat(" ", normalize-space(@class), " "), " entry-content ")]');
foreach($nodes as $node) {
    echo $node->textContent . "\n";
}

The advantage is now also that HTML entities and other HTML that might occur inside the article content is converted as expected. Like &amp; becomes &, and <b>bold</b> just becomes bold.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.