1

I tryed to use file_exists(URL/robots.txt) to see if the file exists on randomly chosen websites and i get a false response;

How do i check if the robots.txt file exists ?

I dont want to start the download before i check.

Using fopen() will do the trick ? because : Returns a file pointer resource on success, or FALSE on error.

and i guess that i can put something like:

$f=@fopen($url,"r"); 
if($f) ...

my code:

http://www1.macys.com/robots.txt maybe it's not there http://www.intend.ro/robots.txt maybe it's not there http://www.emag.ro/robots.txt maybe it's not there http://www1.bloomingdales.com/robots.txt maybe it's not there

try {
            if (file_exists($file)) 
                {
                echo 'exists'.PHP_EOL;
                $curl_tool = new CurlTool();
                $content = $curl_tool->fetchContent($file);
                //if the file exists on local disk, delete it
                if (file_exists(CRAWLER_FILES . 'robots_' . $website_id . '.txt'))
                    unlink(CRAWLER_FILES . 'robots_' . $website . '.txt');
                echo CRAWLER_FILES . 'robots_' . $website_id . '.txt', $content . PHP_EOL;
                file_put_contents(CRAWLER_FILES . 'robots_' . $website_id . '.txt', $content);
            }
            else
            {
                echo 'maybe it\'s not there'.PHP_EOL;
            }
        } catch (Exception $e) {
            echo 'EXCEPTION ' . $e . PHP_EOL;
        }
10
  • 2
    I think you're gonna have to check the response header and see if it contains a 404 not found error. Commented Aug 15, 2012 at 8:33
  • 1
    Do you get any errors? Please ensure you have error reporting and display errors turned on: error_reporting(E_ALL); ini_set('display_errors', 1); Commented Aug 15, 2012 at 8:33
  • fopen with r flag will do the trick Commented Aug 15, 2012 at 8:34
  • @Edwin Drood i would love to do that, i dont know how Commented Aug 15, 2012 at 8:35
  • 1
    @IonutFlaviusPogacian If all you wan't to do is check whether the file exists, you should send a HEAD request and examine the response code, which you can indeed do with curl Commented Aug 15, 2012 at 8:38

3 Answers 3

5

file_exists cannot be used on resources on another websites. It's intended for local filesystem. Have a look here on how to perform the check properly.

As other have mentioned in the comments and as the link says it's (probably) easiest to use get_headers function to do this:

try {
    if (strpos(get_headers($url,1),"404")!==FALSE){
        ... your code ...
    } else {
        ... you get the idea ...
    }
}
Sign up to request clarification or add additional context in comments.

3 Comments

@DaveRandom: That I did not know :) Thanks for pointing it out.
Not yet I'm afraid, look at ),1)
@DaveRandom: Attempt #2... Coffee break methinks :)
3

Just to second what other people said,

it's best to use cURL in php to find out if that http://example.com/robots.txt returns a 404 status code. If it does, then the file does not exist. If it returns a 200 it means it exists.

Be wary of custom 404 pages though, I'm never looked to find out what they return.

2 Comments

@EdwinDrood Not necessarily. There are many, many site admins who understand worryingly little about HTTP and may return an error page with a 200 response code, I've also seen sites that 3xx redirect you to error pages. Both of these are massive protocol violations, but some people seem to think that RFCs don't apply to them.
a link would be usefull for this, using cURL
2

The http:// wrapper does not support stat() functionality, which file_exists() needs; you will need to check the HTTP response code from e.g. cURL.

As of PHP 5.0.0, this function can also be used with some URL wrappers. Refer to Supported Protocols and Wrappers to determine which wrappers support stat() family of functionality.

2 Comments

Is it really that stupid that it won't treat a 200 as "this file exists"?
@DaveRandom: It's PHP. I've stopped wondering if it's "really that stupid" a LONG time ago.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.