1

My goal is to collect headtitles from different news outlets and then echo them on my page. I've tried using Simple HTML DOM, and then run an IF statement to check for keywords. It works, but it is very slow! The code is to be found bellow. Is there a better way to go about this, and if so; how would it be written?

Thanks in advance.

<?php
require 'simple_html_dom.php';

// URL and keyword
$syds = file_get_html('http://www.sydsvenskan.se/nyhetsdygnet');
$syds_key = 'a.newsday__title';

// Debug
$i = 0;

// Checking for keyword "A" in the headtitles
foreach($syds->find($syds_key) as $element) {
   if (strpos($element, 'a') !== false || strpos($element, 'A') !== false) {
      echo $element->href . '<br>';
      $i++;
   }
} 

echo "<h1>$i were found</h1>";
?>
4
  • First, try to treat the HTML like an XML document (php.net/manual/en/book.simplexml.php). Or you can use a library like the Symfony DOM Crawler. Commented Jan 6, 2017 at 22:22
  • 1
    most news sites have rss feeds, much faster to process them Commented Jan 6, 2017 at 22:24
  • Didn't even think of the rss feeds haha, thanks! I will try it out. I will also check out the XML link :) Commented Jan 6, 2017 at 22:28
  • If I use a rss feed as a source, it cannot tell if it is the headline or just a word on the rss feed. I only want to search for the keyword in the headlines. Commented Jan 6, 2017 at 22:53

2 Answers 2

1

How slow are we talking?

1-2 seconds would be pretty good.

If your using this for a website.

I'd advise splitting the crawling and the display into 2 separate scripts, and cache the results of each crawl.

You could:

  • have a crawl.php file that runs periodically to update your links.
  • then have a webpage.php that reads the results of the last crawl and displays it however you need for your website.

This way:

  • Every time you refresh your webpage, it doesn't re-request info from the news site.
  • It's less important that the news site takes a little long to respond.

Decouple crawling/display

You will want to decouple, crawling and display 100%. Have a "crawler.php" than runs over all the news sites one at a time saving the raw links to a file. This can run every 5-10 minutes to keep the news updated, be warned less than 1 minute and some news sites may get annoyed!

crawler.php

<?php
// Run this file from cli every 5-10 minutes
// doesn't matter if it takes 20-30 seconds

require 'simple_html_dom.php';

$html_output = ""; // use this to build up html output

$sites = array(
    array('http://www.sydsvenskan.se/nyhetsdygnet', 'a.newsday__title')
    /* more sites go here, like this */
    // array('URL', 'KEY')
);

// loop over each site
foreach ($sites as $site){
   $url = $site[0];
   $key = $site[1];
   // fetch site
   $syds = file_get_html($url);

   // loop over each link
   foreach($syds->find($key) as $element) {
     // add link to $html_output
     $html_output .= $element->href . '<br>\n';
   }
}
// save $html_output to a local file
file_put_contents("links.php", $html_output);
?>

display.php

/* other display stuff here */
<?php
// include the file of links
include("links.php");
?>

Still want faster?

If you wan't any faster, I'd suggest looking into node.js, its much faster at tcp connections and html parsing.

Sign up to request clarification or add additional context in comments.

15 Comments

The more sources I add, the longer it takes and it becomes very annoying. I'll look into splitting it up and using a crawl. Thanks!
I've added some more info on how to decouple the crawling and display into 2 scripts.
Thanks for your code. But could you edit it so it would use the code I've written earlier? I'm not sure where to put which. :)
Do you want a $syds_key per site? or just 'a.newsday__title' for all?
Yes one key per site, please
|
0

The bottlenecks are:

  • blocking IO - you can switch to an asynchronous library like guzzle

  • parsing - you can switch to a different parser for better parsing speed

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.