2

I have been asked to write a script which will parse all href's in a page and then visit each href and check if each page is up and running ( using the HTTP code from CURL calls) . I have something like below :

<?php foreach($links_array as $link_array): //$links_array are a tags
                $link_array=str_replace("'", "\"", $link_array); // if href='link' instead of href="link"
                $href= get_attribute($link_array, "href");
                $resolved_address= resolve_address($href, $page_base);
                $download_href= http_get($resolved_address,$ref );
                $url=$download_href['STATUS']['url'];
                $http_code=$download_href['STATUS']['http_code'];
                $total_time=$download_href['STATUS']['total_time'];
                $message=$status_code_array[$download_href['STATUS']['http_code']];
                // $status_code_array is an array 
                //if you refer to its index using the http code it give back the human
                //readable message of the code 
                ?>
                <tr>
                <td><?php echo $url ?></td>
                <td><?php echo $http_code ?></td>
                <td><?php echo $http_code ?></td>
                <td><?php echo $total_time ?></td>
                </tr>
           <?php endforeach;?>

The script works for pages with small number of hrefs but if a page has many hrefs the script times out . I have tried increasing max_execution_time in php.ini but this doesnt seem to be an elegant solution . My questions are 1) How does production software works in these type of cases which takes a long time to execute . 2) Can I continue making CURL calls by catching the fatal "Maximum execution time of 60 seconds exceeded"error ? 3) Also it will be better if I can make a curl call for the first href , check for the code, print it using HTML and then make the next curl call for the second href , check for the code , print it and so forth . How can I do this ?

Please bare with my ignorance I am three months into web programming .

6
  • Why do you need the str_replace? This comment if href='link' instead of href="link" is incorrect, both quote types are valid. Commented Sep 30, 2018 at 13:51
  • The script when run from command line should have no timeout. You could set the timeout to 0 if run from a browser, I wouldn't expect a crawler to execute fast though...and also wouldn't rely on output from a crawler. Commented Sep 30, 2018 at 13:54
  • I know both are valid but my get_attribute function only parses " and not ' . I need to look into that bit later .I will try with cmd . Thanks . Commented Sep 30, 2018 at 16:15
  • It works faster in the command line .It has a timeout though which I set to 0 . Also I am curious why you would not rely on output from a crawler ? Commented Sep 30, 2018 at 16:49
  • Oh, do you have to write your own parser? Domdocument already has that function and will work with both Commented Sep 30, 2018 at 17:08

3 Answers 3

3

You can set the max_execution_time in the php.ini file. Make sure you're using the right one as there might be two files (one for fpm, one for cli).

You can see your files here:

php --ini

Also you can set the execution time inside your script.

ini_set('max_execution_time', 300);

Alternatively, you can set the time in your php command as well.

php -dmax_execution_time=300 script.php

To answer your other questions:

How does production software works in these type of cases

One way (in PHP) would be using worker (RabbitMQ/AMQP). This means you have one script that 'sends' messages into a queue and n worker. These workers pull messages from that queue until it is empty.

https://github.com/php-amqplib/php-amqplib

Can I continue making CURL calls by catching the fatal "Maximum execution time of 60 seconds exceeded"error

Yes, but there is no thrown exception for that. You can achieve it with something like this:

if (curl_errno($ch)){
    echo 'Request Error:' . curl_error($ch);
}
Sign up to request clarification or add additional context in comments.

Comments

1

For links pointing to broken servers, the curl timeout could take a long time. With 10 broken links, the script could take few minutes to finish.

I would suggest storing links_array in some database, xml or json file with a check queue. And create a script that will check all links in the queue and store http_code response and other data in this database or xml data.

Then you need an ajax script that will query every X seconds the server to get all the new checked links from the xml file or database and put these data on the html page.

You could use a cron job or rabbitMQ to start the links checking script.

Comments

0

use CURLOPT_TIMEOUT. your updated code:

ini_set('max_execution_time', 0);

foreach($links_array as $link){
    $start       = microtime(true);
    $link        = get_attribute( str_replace( '\'', '"', $link ), 'href' );
    $url         = resolve_address( $link, $page_base );
    $http_code   = getHttpCode( $url );
    $total_time  = microtime(true) - $start;
    if($http_code != 0){
        echo '<tr>
                <td>' . $url . '</td>
                <td>' . $http_code . '</td>
                <td>' . $total_time . ' s. </td>
            </tr>';
    }
}

function getHttpCode( $url )
{
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_HEADER, true);
    curl_setopt($ch, CURLOPT_NOBODY, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_TIMEOUT, 10);
    $output = curl_exec($ch);
    $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    return $httpcode;
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.