3

I am checking for url & return "valid" if url status code "200" & "invalid" if its on "404",

urls are links which redirect to a certain page (url) & i need to check that page (url) status to determine if its valid or invalid on the basis of its status code.

<?php

// From URL to get redirected URL
$url = 'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625';
  
// Initialize a CURL session.
$ch = curl_init();
  
// Grab URL and pass it to the variable.
curl_setopt($ch, CURLOPT_URL, $url);
  
// Catch output (do NOT print!)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
  
// Return follow location true
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
$html = curl_exec($ch);
  
// Getinfo or redirected URL from effective URL
$redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  
// Close handle
curl_close($ch);
echo "Original URL:   " . $url . "<br/> </br>";
echo "Redirected URL: " . $redirectedUrl . "<br/>";

 function is_url_valid($url) {
  $handle = curl_init($url);
  curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($handle, CURLOPT_NOBODY, true);
  curl_exec($handle);
 
  $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
  curl_close($handle);
 
  if ($httpCode == 200) {
    return 'valid link';
  }
  else {
    return 'invalid link';
  }
}

// 
echo "<br/>".is_url_valid($redirectedUrl)."<br/>";

As you can see the above link has status 400 still it shows "valid" I am using above code, any thoughts or correction's ? in order to make it work as expected ? It seems like the site has more then one redirected url & script checks for only one that's why it shows valid. any thoughts how to fix it ?

Here are the links which i am checking with

ISSUE -

FOR EXAMPLE - If i check with this link https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 then in browser it goes on "404" but in script o/p its "200"

16
  • The above link has Status Code: 302 & redirected to new url which has status code 200, i want to check the end url (last url). Commented Jul 3, 2021 at 5:18
  • $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE)); - just to be safe make sure it's an integer for your comparison Commented Jul 3, 2021 at 5:20
  • Thanks for the comment & suggestion, though i am getting 404 as status code in output Commented Jul 3, 2021 at 5:22
  • 2
    @devhs - I am not sure if it is proper solution or not. But I checked some of the above links, they are managing custom page for 404. As a quick solution, you can get the contents of the URL with "file_get_contents" and check the "Page Title". Commented Jul 5, 2021 at 9:26
  • 1
    By "Refresh" header, I mean header("Refresh:5; url=page2.php"); in this case curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); doesn't follow redirections, another is meta refresh http-equiv header and javascript redirects Commented Jul 9, 2021 at 13:02

5 Answers 5

1
+25

I use the get_headers() function for this. If I found a status 2xx in the array then the URL is ok.

function urlExists($url){
  $headers = @get_headers($url);
  if($headers === false) return false;
  return preg_grep('~^HTTP/\d+\.\d+\s+2\d{2}~',$headers) ? true : false;
}
Sign up to request clarification or add additional context in comments.

9 Comments

Thanks for the answer, but what if the main url has redirections (multiple redirections) ? Suppose this url - shareasale.com/…
The function returns true for this URL. Is that ok?
No its not as page's status code is 404 (not found) so it should not return true
I don't get an ad if Javascript is deactivated in my browser. I think this forwarding is done via javascript. This problem cannot be solved with PHP alone.
I don't have a quick fix.
|
1

This is my take on this issue. Basically, the takeaway is:

  1. You didn't need to make more than one request. Using CURLOPT_FOLLOWLOCATION will do all the job for you, and in the end, the http response code that you will get is the one from the final call in case of a/some redirection(s).
  2. Since you are using CURLOPT_NOBODY, the request will use a HEAD method and will not return anything. For that reason, CURLOPT_RETURNTRANSFER is useless.
  3. I have taken the liberty of using my own coding style (no offence).
  4. Since I was running the code from a Phpstorm's Scratch file, I have added some PHP_EOL as line breaks to format the output. Feel free to remove them. 

...  

<?php

$linksToCheck = [
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=547531.5112&type=15&murl=https%3A%2F%2Fwww.peopletree.co.uk%2Fwomen%2Fdresses%2Fanna-checked-dress',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.2335&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fagnetha-black-floral-print-bamboo-dress-midnight-navy%2F%2392%3D1390%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.752&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fbernice-floral-tunic-dress%2F%2392%3D1273%26142%3D198',
    'https://click.linksynergy.com/link?id=GsILx6E5APM&offerid=330522.6863&type=15&murl=https%3A%2F%2Fwww.wearethought.com%2Fjosefa-smock-shift-dress-in-midnight-navy-hemp%2F%2392%3D1390%26142%3D208',
    'https://www.shareasale.com/m-pr.cfm?merchantID=16570&userID=1860618&productID=546729471',
    'https://www.shareasale.com/m-pr.cfm?merchantID=53661&userID=1860618&productID=680698793',
    'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518',
    'https://www.shareasale.com/m-pr.cfm?merchantID=83483&userID=1860618&productID=916465625',
];

function isValidUrl($url) {
    echo "Original URL:   " . $url . "<br/>\n";

    $handle = curl_init($url);

    // Follow any redirection.
    curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);

    // Use a HEAD request and do not return a body.
    curl_setopt($handle, CURLOPT_NOBODY, true);

    // Execute the request.
    curl_exec($handle);

    // Get the effective URL.
    $effectiveUrl = curl_getinfo($handle, CURLINFO_EFFECTIVE_URL);
    echo "Effective URL:   " . $effectiveUrl . "<br/> </br>";

    $httpResponseCode = (int) curl_getinfo($handle, CURLINFO_HTTP_CODE);

    // Close this request.
    curl_close($handle);

    if ($httpResponseCode == 200) {
        return '✅';
    }
    else {
        return '❌';
    }
}

foreach ($linksToCheck as $linkToCheck) {
    echo PHP_EOL . "Result: " . isValidUrl($linkToCheck) . PHP_EOL . PHP_EOL;
}

1 Comment

haha cool use of utf8! unfortunately OP want to follow javascript redirects as well, see my answer below for info :(
1

Note: We have used CURLOPT_NOBODY to just check for the connection and not to fetch the whole body.

  $url = "Your URL";
  $curl = curl_init($url);
  curl_setopt($curl, CURLOPT_NOBODY, true);
  $result = curl_exec($curl);
 if ($result !== false)
 {
    $statusCode = curl_getinfo($curl, CURLINFO_HTTP_CODE);  
 if ($statusCode == 404)
 {
   echo "URL Not Exists"
 }
 else
 {
   echo "URL Exists";
  }
 }
else
{
  echo "URL not Exists";
}

Comments

0

The below code works well but when i put urls in array & test the same functionality then it does not give proper results ? Any thoughts why ? Also if any body would like to update answer to make it dynamic in the sense (should check multiple url at once, when an array of url provided).

  <?php
    
    // URL to check
    $url = 'https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518';
      
    $ch = curl_init(); // Initialize a CURL session.
    curl_setopt($ch, CURLOPT_URL, $url); // Grab URL and pass it to the variable.
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Catch output (do NOT print!)
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE); // Return follow location true
    $html = curl_exec($ch);
    $redirectedUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // Getinfo or redirected URL from effective URL
    curl_close($ch); // Close handle
    
    $get_final_url = get_final_url($redirectedUrl);
    if($get_final_url){
        echo is_url_valid($get_final_url);
    }else{
        echo $redirectedUrl ? is_url_valid($redirectedUrl) : is_url_valid($url);
    }
    
    function is_url_valid($url) {
      $handle = curl_init($url);
      curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($handle, CURLOPT_NOBODY, true);
      curl_exec($handle);
     
      $httpCode = intval(curl_getinfo($handle, CURLINFO_HTTP_CODE));
      curl_close($handle);
      echo $httpCode;
      if ($httpCode == 200) {
        return '<b> Valid link </b>';
      }
      else {
        return '<b> Invalid link </b>';
      }
    }
    
    function get_final_url($url) {
            $ch = curl_init();
            if (!$ch) {
                return false;
            }
            $ret = curl_setopt($ch, CURLOPT_URL,            $url);
            $ret = curl_setopt($ch, CURLOPT_HEADER,         1);
            $ret = curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
            $ret = curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
            $ret = curl_setopt($ch, CURLOPT_TIMEOUT,        30);
            $ret = curl_exec($ch);
    
            if (!empty($ret)) {
                $info = curl_getinfo($ch);
                curl_close($ch);
                return false;
            if (empty($info['http_code'])) {
                return false;
            } else {
                preg_match('#(https:.*?)\'\)#', $ret, $match);
                $final_url = stripslashes($match[1]);
                return stripslashes($match[1]);
            }
        }
    } 

1 Comment

just an idea: requests from your script come in with a pattern the host detects and then counteracts your intends. or as you would perhaps word it: why does that host undermine my expectations? it's their server, you can only send requests and you have to live with the answer (response) ;)
0

see, the problem here is that you want to follow JAVASCRIPT redirects, the url you're complaining about https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518 does redirect to a url responding HTTP 200 OK, and that page contains the javascript

<script LANGUAGE="JavaScript1.2">
                window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
                </script>

so your browser, which understands javascript, follows the javascript redirect, and that js redirect is to a 404 page.. unfortunately there is no good way to do this from PHP, your best bet would probably be a headless web browser, eg PhantomJS or puppeteer or Selenium or something like that.

still, you can kinda hack in a regex-search for a javascript redirect and hope for the best, eg

<?php
function is_url_valid(string $url):bool{
    if(0!==strncasecmp($url,"http",strlen("http"))){
        // file:///etc/passwd and stuff like that aren't considered valid urls right?
        return false;
    }
    $ch=curl_init();
    if(!curl_setopt_array($ch,array(
        CURLOPT_URL=>$url,
        CURLOPT_FOLLOWLOCATION=>1,
        CURLOPT_RETURNTRANSFER=>1
    ))){
        // best guess: the url is so malformed that even CURLOPT_URL didn't accept it.
        return false;
    }
    $resp= curl_exec($ch);
    if(false===$resp){
        return false;
    }
    if(curl_getinfo($ch,CURLINFO_RESPONSE_CODE) != 200){
        // only HTTP 200 OK is accepted
        return false;
    }
    // attempt to detect javascript redirects... sigh
    // window.location.replace('https:\/\/www.tenthousandvillages.com\/bicycle-statue?sscid=71k5_4yt9r ')
    $rex = '/location\.replace\s*\(\s*(?<redirect>(?:\'|\")[\s\S]*?(?:\'|\"))/';
    if(!preg_match($rex, $resp, $matches)){
        // no javascript redirects detected..
        return true;
    }else{
        // javascript redirect detected..
        $url = trim($matches["redirect"]);
        // javascript allows both ' and " for strings, but json only allows " for strings
        $url = str_replace("'",'"',$url);
        $url = json_decode($url, true,512,JSON_THROW_ON_ERROR); // we extracted it from javascript, need json decoding.. (well, strictly speaking, it needs javascript decoding, but json decoding is probably sufficient, and we only have a json decoder nearby)
        curl_close($ch);
        return is_url_valid($url);
    }
}
var_dump(

    is_url_valid('https://www.shareasale.com/m-pr.cfm?merchantID=66802&userID=1860618&productID=1186005518'),
    is_url_valid('http://example.org'),
    is_url_valid('http://example12k34jr43r5ehjegeesfmwefdc.org'),
    
);

but that's a dodgy hacky solution, to put it mildly..

6 Comments

Thanks for the answer, let me check will it work with multiple url at once like if i crate an array of url posted in question & call method "is_url_valid" in loop
@devhs shouldn't be a problem, btw i just noticed this approach has another significant weakness: it doesn't handle infinite redirects. eg if page1 redirects to page2 redirects to page1 redirects to page2.... , this script will just follow the redirects forever, until a php max_execution_time is reached, or until the call-stack is exhausted. (it's possible to fix though)
Thanks i will check it, I has justed tested it here - paiza.io/projects/N3m4E11HZAmq5uTb8gLjcg but ii does not seems to be working.
@devhs that url returns bool(true) for me if make sure to change it from paiza.io to https://paiza.io , what do you get?
When you visit this link, its a compiler where i have tested your code paiza.io/projects/N3m4E11HZAmq5uTb8gLjcg
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.