How slow are we talking?
1-2 seconds would be pretty good.
If your using this for a website.
I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.
You could:
- have a
crawl.php file that runs periodically to update your links.
- then have a
webpage.php that reads the results of the last crawl and displays it however you need for your website.
This way:
- Every time you refresh your webpage, it doesn't re-request info from the news site.
- It's less important that the news site takes a little long to respond.
Decouple crawling/display
You will want to decouple, crawling and display 100%.
Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!
crawler.php
<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds
require 'simple_html_dom.php';
$html_output = ""; // use this to build up html output
$sites = array(
array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
/* more sites go here, like this */
// array('URL', 'KEY')
);
// loop over each site
foreach ($sites as $site){
$url = $site[0];
$key = $site[1];
// fetch site
$syds = file_get_html($url);
// loop over each link
foreach($syds->find($key) as $element) {
// add link to $html_output
$html_output .= $element->href . '<br>\n';
}
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>
display.php
/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>
Still want faster?
If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.