24

I'm having a problem with PHP's cURL returning an empty string with some URL's. I'm trying to parse the OG metadata of different webpages and it works with all websites I've tried except for NYTimes. Here is my code so far.

print_r(get_og_metadata('http://somewebsite.com'));


public function get_data($url)
{
    $ch = curl_init();
    $timeout = 5;
    // the url to fetch
    curl_setopt($ch, CURLOPT_URL, $url);
    // return result as a string rather than direct output
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    // set max time of cURL execution
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
    $data = curl_exec($ch);
    curl_close($ch);
    return $data;
}

public function get_og_metadata($url)
{
    libxml_use_internal_errors(TRUE);
    $data = $this->_get_data($url);
    $doc = new DOMDocument();
    $doc->loadHTML($data);

    $xpath = new DOMXPath($doc);
    $query = '//*/meta[starts-with(@property, \'og:\')]';

    $metadatas = $xpath->query($query);
    $result = array();
    foreach($metadatas as $metadata)
    {
        $property = $metadata->getAttribute('property');
        $content = $metadata->getAttribute('content');
        $result[$property] = $content;
    }

    return $result;
}
2
  • function called get_data but you call _get_data ? Commented Feb 4, 2013 at 3:25
  • 1
    whoops! that was just a mistake when copying the code here. good catch though! Commented Feb 4, 2013 at 4:00

4 Answers 4

37

These 5 lines did the magic for me.

   curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');
   curl_setopt($ch, CURLOPT_AUTOREFERER, true); 
   curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
   curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
   curl_setopt($ch, CURLOPT_VERBOSE, 1);
Sign up to request clarification or add additional context in comments.

2 Comments

CURLOPT_FOLLOWLOCATION and CURLOPT_USERAGENT save me. Thanks
CURLOPT_FOLLOWLOCATION did the trick
19

My guess is that a site like the New York times has protection against such behavior. Most likely this is based on the user agent, which you can fake as so:

curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17');

This is the most common agent btw.

1 Comment

Setting the user agent didn't work but setting auto_referrer to TRUE actually did. Your answer did help me rethink what could have been causing the problem though! curl_setopt($ch, CURLOPT_AUTOREFERER, true);
12

(That other answer is me also)

This is what did it for me. It was looking for SSL verificaiton, which I happened to not need in this specific case.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);

3 Comments

This helped me in going to an SSL API from a dev (non-SSL) environment. Thanks!
Thanks, this helped me as well... i tried try/catch option to try to catch any errors, but curl response was just blank... no errors. Soon as i added this, i got response from HTTPS server
This one did that for me!
4

This is what did it for me. It was looking for SSL verificaiton, which I happened to not need in this specific case.

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.