3

I am trying to make a simple web crawler with PHP and I am having issues getting the HTML source of a given URL. I am currently using cURL to get the source.

My code:

 $url = "http://www.nytimes.com/";

    function url_get_contents($Url) {
        if (!function_exists('curl_init')) {
            die('CURL is not installed!');
        }
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $Url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        $output = curl_exec($ch);
        if ($output === false) { die(curl_error($ch)); }
        curl_close($ch);
        return $output;
    }

    echo url_get_contents($url);
    ?>

Right now nothing gets echoed and there aren't any errors, so it is a bit of a mystery. Any suggestions or fixes will be appreciated

Edit: I added

if ($output === false) { die(curl_error($ch)); }

to the middle of the function and it ended up giving me an error (finally!):

Could not resolve host: www.nytimes.com

I still do not really know what the problem is. Any ideas?

Thanks

5
  • 2
    you never bothered checking if curl succeeded. if ($output === false) { die(curl_error($ch)); } Commented Jun 25, 2015 at 21:21
  • stackoverflow.com/questions/6516902/… should help. Commented Jun 25, 2015 at 21:22
  • 3
    $Url != $url also - variables are case sensitive Commented Jun 25, 2015 at 21:23
  • Probably nytimes.com has something to prevent web crawling. Have you tried with a different url? Commented Jun 25, 2015 at 22:58
  • 1
    @AlvaroFlañoLarrondo False. curl -i http://www.nytimes.com/ returns an HTTP/1.1 200 response. Commented Jun 25, 2015 at 23:47

2 Answers 2

2

Turns out that it was not a cURL problem

My host server (Ubuntu VM) was working off of a "host-only" network adapter which blocked access to all other IPs or domains outside of it's host machine making it impossible for cURL to connect to URLs.

Once it was changed to "bridged" network adapter I had access to the outside world.

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

0

Variable case mismatch ($url vs. $Url). Change:

function url_get_contents($Url) {

to

function url_get_contents($url) {

2 Comments

The two variables are used in different context, inside and outside the function. Plus the edited question shows that the url is corectly read.
@AlvaroFlañoLarrondo This answer was posted prior to the question edit at a time where the variable names did not align within the function. I was keenly aware that there are 2 variables in two different contexts.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.