0

I have a DB with URLs of manufacturers collected in the last years and I need to do some spring cleaning:

  1. Some urls are like http://brandname.com/aboutus/ so i need to remove any path other than just the main domain, because many of those path/subdirectory may have expired...

  2. I would love to be able to check if those domain actually exists anymore or are taken by domain sharks...

I'm currently using PHP+MySQL

4
  • 1
    Well, what is your question here? Obviously you need to take the URLs one by one, use parse_url() to pick the tokens you need, so scheme and hostname here, and then make a test request. I'd even say that you are not at all interested in the domains, but in the hostnames, since a domain without web service most likely is without value to you... Commented Oct 12, 2016 at 16:14
  • Use regular expression stackoverflow.com/questions/569137/… Commented Oct 12, 2016 at 16:21
  • @arkascha thanks for pointing me to parse_url! Commented Oct 12, 2016 at 16:22
  • @Suraj Regular expressions are only a last means if no better matching function exists... Commented Oct 12, 2016 at 16:23

1 Answer 1

2

Below is a function for doing what you ask, with references to Stack Overflow answers which give the details you need.

First:
Parse the URL using the PHP standard filter_var Validate (and Sanitise) functions. You may also need to ensure that the scheme is properly defined.

Second,
Run a PHP cURL request to get the HTTP header of the full URL and then of the site URL. Source.

$url = 'http://www.example.com/folder/file.php';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_HEADER, true);    // we want headers
curl_setopt($ch, CURLOPT_NOBODY, true);    // we don't need body
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,10);
$output = curl_exec($ch);
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);

echo 'HTTP code: ' . $httpcode;

Third
If the $httpcode returns a 200 then it's good working link, else we need to cut the link down to just the site and recheck if the site (still) exists. You can do this using Parse_url. Source.

so: 
if($httpcode == 200){
    //works
}
if($httpcode >= 400 ){
     /*** errors 400+ ***/
    $siteUrlParts = parse_url($url);
    $siteUrl = $siteUrlParts['scheme']."//".$siteUrlParts['host'];
}
else {
   //some other header, up to you how you want to handle this.
   // could be a redirect 302 or something...  
}

Note the schema part is important not just the host part.

Fourth
That's it, update the database row with the new working URL.

All Together:

function get_header_code($url){
    /*** 
     cURL
     ***/
    $ch = curl_init($link);
    curl_setopt($ch, CURLOPT_HEADER, true);    // we want headers
    curl_setopt($ch, CURLOPT_NOBODY, true);    // we don't need body
    curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
    curl_setopt($ch, CURLOPT_TIMEOUT,10);
    $output = curl_exec($ch);
    $httpCode = curl_getinfo($ch, CURLINFO_HTTP_CODE);
    curl_close($ch);
    return $httpCode;
}

function clean_url($link){
    $link = strtolower($link);
    $link = filter_var($link, FILTER_SANITIZE_URL);

    if(substr($link,0,8) !== "https://" && substr($link,0,7) !== "http://"){
        $link = "http://".$link;
    }

    if(filter_var($link, FILTER_VALIDATE_URL) === FALSE){
    /***
     Invalid URL so clean and remove.
     ***/
    return false;
    }
    $httpCode = get_header_code($link);

    if($httpCode == 200){
      /***
       works, so return full URL
       ***/
      return $link;
    }
    if($httpcode >= 400 ){
     /*** errors 400+ ***/
        $siteUrlParts = parse_url($link);
        $siteUrl = $siteUrlParts['scheme']."://".$siteUrlParts['host'];
        if(get_header_code($siteUrl) == 200){
             /***
              Obviously you can add conditionals to accept if it is a
              redirection but this is a basic example
              ***/  
             return $siteUrl;
        }
        return false;
    }
    else {
       /***
        some other header, up to you how you want to handle this.
        could be a redirect 301, 302 or something... 
        ***/
       return false; 
    }

}

And run it as:

/***
 returns either false or the URL of a working domain from the Db.
 ***/
$updateValueUrl = clean_url($databaseRow['url']);

This is probably not quite perfect for you but should give you a good grounding from which to make your desired behaviour. Once this is in place you then can run a PHP MySQL loop to grab every URL (in LIMIT batches of maybe 500 or 1000) at a time and loop through each one using foreach and updating each with the output from these functions.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.