2

I want to parse a lot of URLs to only get their status codes.

So what I did is:

$handle = curl_init($url -> loc);
             curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
             curl_setopt($handle, CURLOPT_HEADER  , true);  // we want headers
             curl_setopt($handle, CURLOPT_NOBODY  , true);
             curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
             $response = curl_exec($handle);
             $httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);
             curl_close($handle);

But as soon as the "nobody"-option is set to true, the returned status codes are incorrect (google.com returns 302, other sites return 303).

Setting this option to false is not possible because of the performance loss.

Any ideas?

3
  • do a custom request and issue only a HEAD. doing a full-blown get will also transfer the body. head gives you ONLY the headers. Commented Dec 1, 2014 at 19:59
  • @MarcB could you show me your supposed code? Commented Dec 1, 2014 at 20:00
  • curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'HEAD') Commented Dec 1, 2014 at 20:03

3 Answers 3

2

The default HTTP request method for curl is GET. If you want only the response headers, you can use the HTTP method HEAD.

curl_setopt($handle, CURLOPT_CUSTOMREQUEST, 'HEAD');

According to @Dai's answer, the NOBODY is already using the HEAD method. So the above method will not work.

Another option would be to use fsockopen to open a connection, write the headers using fwrite. Read the response using fgets until the first occurrence of \r\n\r\n to get the complete header. Since you need only the status code, you just need to read the first 13 characters.

<?php
$fp = fsockopen("www.google.com", 80, $errno, $errstr, 30);
if ($fp) {
    $out = "GET / HTTP/1.1\r\n";
    $out .= "Host: www.google.com\r\n";
    $out .= "Accept-Encoding: gzip, deflate, sdch\r\n";
    $out .= "Accept-Language: en-GB,en-US;q=0.8,en;q=0.6\r\n";
    $out .= "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36\r\n";
    $out .= "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\r\n";
    $out .= "Connection: Close\r\n\r\n";
    fwrite($fp, $out);
    $tmp = explode(' ', fgets($fp, 13));
    echo $tmp[1];
    fclose($fp);
}
Sign up to request clarification or add additional context in comments.

7 Comments

So I'd just have to add this option or remove one of the already set options?
Yes, add it before curl_exec. The HEAD method does not contain a response body, so you can remove CURLOPT_NOBODY.
As per @Dai's answer, it is not going to work. Sorry.
unfortunately, that doesn't execute as fast as with CURLOPT_NOBODY (in fact, it seems to be just as slow as without CURLOPT_NOBODY and without a custom request) and the returned status code is still wrong...
Try the above code. It gives 302 for www.google.com, and is very fast too.
|
1

cURL's nobody option has it use the HEAD HTTP verb, I'd wager the majority of non-static web applications I the wild don't handle this verb correctly, hence the problems you're seeing with different results. I suggest making a normal GET request and discarding the response.

5 Comments

the problem is performance - I need to do thousands of such requests in a row. Is there anything else I could do to speed it up?
cheap multi-threading, divide the number of urls to check by the number of cores you have, run one script per core doing a subset of the url's
you are right... it doesn't help that php flush() doesn't work because the server is running nginx, because then I could at least write out some status...
strangely I just found out, all status codes >= 400 are correct, and because that's all I need, it's ok for me...
Ok, this code is really insanely fast, but my URLs don't seem to work - like if I use www.raffiniert.biz/kunden/coop_ch which should return a 403, it just returns nothing, also if the URL has a trailing slash?
0

i suggest get_headers() instead:

<?php
$url = 'http://www.example.com';

print_r(get_headers($url));

print_r(get_headers($url, 1));
?>

3 Comments

that's not as fast as cURL according to my tests?
ok, glad you benchmarked it. will leave for future reference.
sorry, I wanted to say: It's faster than cURL with body, but slower than cURL without body (according to my tests). YMMV

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.