0

I'd like to get all articles from the webpage, as well as get all pictures for the each article.

I decided to use PHP Simple HTML DOM Parse and I used the following code:

<?php

include("simple_html_dom.php");

$sitesToCheck = array(
    array(
        'url' => 'http://googleblog.blogspot.ru/',
        'search_element' => 'h2.title a',
        'get_element' => 'div.post-content'
    ),
    array(
        // 'url' => '',            // Site address with a list of of articles
        // 'search_element' => '', // Link of Article on the site
        // 'get_element' => ''     // desired content
    )
);

$s = microtime(true);

foreach($sitesToCheck as $site)
{
    $html = file_get_html($site['url']);

    foreach($html->find($site['search_element']) as $link)
    {
        $content   = '';
        $savePath  = 'cachedPages/'.md5($site['url']).'/';
        $fileName  = md5($link->href);

        if ( ! file_exists($savePath.$fileName))
        {
            $post_for_scan = file_get_html($link->href);

            foreach($post_for_scan->find($site["get_element"]) as $element)
            {
                $content .= $element->plaintext . PHP_EOL;
            }

            if ( ! file_exists($savePath) && ! mkdir($savePath, 0, true))
            {
                die('Unable to create directory ...');
            }

            file_put_contents($savePath.$fileName, $content);
        }
    }
}

$e = microtime(true);

echo $e-$s;

I will try to get only articles without pictures. But I get the response from the server

"Maximum execution time of 120 seconds exceeded"

.

What I'm doing wrong? Is there any other way to get all the articles and all pictures for each article for a specific webpage?

6
  • 1
    So much for the "simple" part, eh. :) Seriously, though, last time i checked it (a few months ago), simple_html_dom was still a steaming pile. DOMDocument + DOMXPath took like 1/5 of the space and time. Literally. I cut my memory usage and run time by 80% by getting rid of it. Commented Nov 27, 2013 at 14:53
  • you shouldn't rely too much on this, but if you know beforehand that a process is going to take a long time, try set_time_limit(0) however it's not a good practice to use it everywhere. That'll prevent php from killing your process if it exceeds the maximum execution time (120 s in your case), and it will run until it finishes. The problem is... if you make a mistake in your program that causes to run forever, your program will sit in the server consuming resources until manual action is taken. Commented Nov 27, 2013 at 15:02
  • 1
    Just so i don't sound like a rabid hater, there is one thing simple_html_dom might be good for. If you have HTML that's mangled so badly that it no longer looks like HTML, DOMDocument might not handle it well. A lib like simple_html_dom might do better with such garbage, as it's designed to work with wacky markup. But it's rare to have to parse a document that's so horribly broken that DOMDocument can't handle it. At least, i've never had to deal with it. Commented Nov 27, 2013 at 15:35
  • @cHao Yes, unlike loading XML, HTML does not have to be well-formed to load. Commented Nov 27, 2013 at 15:49
  • Right...so any even-pseudo-competent parser will more-or-less work. A parser like simple_html_dom, that's specifically intended to work with malformed HTML, might have a better chance of seeing the document as the author intended. But the HTML would have to be quite broken in order for a more forgiving parser to justify the resource/performance hit. Commented Nov 27, 2013 at 17:36

1 Answer 1

1

I had similar problems with that lib. Use PHP's DOMDocument instead:

$doc = new DOMDocument;
$doc->loadHTML($html);
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
  doSomethingWith($link->getAttribute('href'), $link->nodeValue);
}

See http://www.php.net/manual/en/domdocument.getelementsbytagname.php

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you. Now necessary to understand how to get items using queries like div.post-content, table.wrapper td.content or div p a, etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.