PHP Simple HTML DOM Parser: Get all posts

Question

I'd like to get all articles from the webpage, as well as get all pictures for the each article.

I decided to use PHP Simple HTML DOM Parse and I used the following code:

<?php

include("simple_html_dom.php");

$sitesToCheck = array(
    array(
        'url' => 'http://googleblog.blogspot.ru/',
        'search_element' => 'h2.title a',
        'get_element' => 'div.post-content'
    ),
    array(
        // 'url' => '',            // Site address with a list of of articles
        // 'search_element' => '', // Link of Article on the site
        // 'get_element' => ''     // desired content
    )
);

$s = microtime(true);

foreach($sitesToCheck as $site)
{
    $html = file_get_html($site['url']);

    foreach($html->find($site['search_element']) as $link)
    {
        $content   = '';
        $savePath  = 'cachedPages/'.md5($site['url']).'/';
        $fileName  = md5($link->href);

        if ( ! file_exists($savePath.$fileName))
        {
            $post_for_scan = file_get_html($link->href);

            foreach($post_for_scan->find($site["get_element"]) as $element)
            {
                $content .= $element->plaintext . PHP_EOL;
            }

            if ( ! file_exists($savePath) && ! mkdir($savePath, 0, true))
            {
                die('Unable to create directory ...');
            }

            file_put_contents($savePath.$fileName, $content);
        }
    }
}

$e = microtime(true);

echo $e-$s;

I will try to get only articles without pictures. But I get the response from the server

"Maximum execution time of 120 seconds exceeded"

.

What I'm doing wrong? Is there any other way to get all the articles and all pictures for each article for a specific webpage?

So much for the "simple" part, eh. :) Seriously, though, last time i checked it (a few months ago), simple_html_dom was still a steaming pile. DOMDocument + DOMXPath took like 1/5 of the space and time. Literally. I cut my memory usage and run time by 80% by getting rid of it. — cHao
– cHao, Commented Nov 27, 2013 at 14:53
you shouldn't rely too much on this, but if you know beforehand that a process is going to take a long time, try set_time_limit(0) however it's not a good practice to use it everywhere. That'll prevent php from killing your process if it exceeds the maximum execution time (120 s in your case), and it will run until it finishes. The problem is... if you make a mistake in your program that causes to run forever, your program will sit in the server consuming resources until manual action is taken. — ILikeTacos
– ILikeTacos, Commented Nov 27, 2013 at 15:02
Just so i don't sound like a rabid hater, there is one thing simple_html_dom might be good for. If you have HTML that's mangled so badly that it no longer looks like HTML, DOMDocument might not handle it well. A lib like simple_html_dom might do better with such garbage, as it's designed to work with wacky markup. But it's rare to have to parse a document that's so horribly broken that DOMDocument can't handle it. At least, i've never had to deal with it. — cHao
– cHao, Commented Nov 27, 2013 at 15:35
@cHao Yes, unlike loading XML, HTML does not have to be well-formed to load. — serghei
– serghei, Commented Nov 27, 2013 at 15:49
Right...so any even-pseudo-competent parser will more-or-less work. A parser like simple_html_dom, that's specifically intended to work with malformed HTML, might have a better chance of seeing the document as the author intended. But the HTML would have to be quite broken in order for a more forgiving parser to justify the resource/performance hit. — cHao
– cHao, Commented Nov 27, 2013 at 17:36

svidgen · Accepted Answer · 2013-11-27 14:59:03Z

1

I had similar problems with that lib. Use PHP's DOMDocument instead:

$doc = new DOMDocument;
$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
  doSomethingWith($link->getAttribute('href'), $link->nodeValue);
}

See http://www.php.net/manual/en/domdocument.getelementsbytagname.php

answered Nov 27, 2013 at 14:59

svidgen

14.4k4 gold badges37 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

serghei Over a year ago

Thank you. Now necessary to understand how to get items using queries like div.post-content, table.wrapper td.content or div p a, etc.

Collectives™ on Stack Overflow

PHP Simple HTML DOM Parser: Get all posts

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related