PHP Fast scraping

Question

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?

Thanks in advance.

<?php
require 'simple_html_dom.php';

// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';

// Debug
$i = 0;

// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
   if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
      echo $element->href . '<br>';
      $i++;
   }
} 

echo "<h1>$i were found</h1>";
?>

First, try to treat the HTML like an XML document (php.net/manual/en/book.simplexml.php). Or you can use a library like the Symfony DOM Crawler. — Aerendir
– Aerendir, Commented Jan 6, 2017 at 22:22
Didn't even think of the rss feeds haha, thanks! I will try it out. I will also check out the XML link :) — Hillaren
– Hillaren, Commented Jan 6, 2017 at 22:28
If I use a rss feed as a source, it cannot tell if it is the headline or just a word on the rss feed. I only want to search for the keyword in the headlines. — Hillaren
– Hillaren, Commented Jan 6, 2017 at 22:53

Phil Poore · Accepted Answer · 2017-01-06 23:09:26Z

1

How slow are we talking?

1-2 seconds would be pretty good.

If your using this for a website.

I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.

You could:

have a crawl.php file that runs periodically to update your links.
then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.

This way:

Every time you refresh your webpage, it doesn't re-request info from the news site.
It's less important that the news site takes a little long to respond.

Decouple crawling/display

You will want to decouple, crawling and display 100%. Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!

crawler.php

<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds

require 'simple_html_dom.php';

$html_output = ""; // use this to build up html output

$sites = array(
    array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
    /* more sites go here, like this */
    // array('URL', 'KEY')
);

// loop over each site
foreach ($sites as $site){
   $url = $site[0];
   $key = $site[1];
   // fetch site
   $syds = file_get_html($url);

   // loop over each link
   foreach($syds->find($key) as $element) {
     // add link to $html_output
     $html_output .= $element->href . '<br>\n';
   }
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>

display.php

/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>

Still want faster?

If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.

edited Jan 6, 2017 at 23:09

answered Jan 6, 2017 at 22:28

Phil Poore

2,28621 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

15 Comments

Hillaren Over a year ago

The more sources I add, the longer it takes and it becomes very annoying. I'll look into splitting it up and using a crawl. Thanks!

Phil Poore Over a year ago

I've added some more info on how to decouple the crawling and display into 2 scripts.

Hillaren Over a year ago

Thanks for your code. But could you edit it so it would use the code I've written earlier? I'm not sure where to put which. :)

Phil Poore Over a year ago

Do you want a $syds_key per site? or just 'a.newsday__title' for all?

Hillaren Over a year ago

Yes one key per site, please

|

pguardiario · Accepted Answer · 2017-01-07 04:07:18Z

0

The bottlenecks are:

blocking IO - you can switch to an asynchronous library like guzzle
parsing - you can switch to a different parser for better parsing speed

answered Jan 7, 2017 at 4:07

pguardiario

55.2k21 gold badges130 silver badges169 bronze badges

Collectives™ on Stack Overflow

PHP Fast scraping

2 Answers 2

Decouple crawling/display

15 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Decouple crawling/display

15 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related