1

I have a set of URL's in an array. Some are just domains (http://google.com) and some are subdomains (http://test.google.com).

I am trying to extract just the domain part from each of them without the subdomain.

parse_url($domain)

still keeps the subdomain.

Is there another way?

9
  • 3
    How would this work for two-part TLDs? Consider test.google.co.uk - which part would you want in that case? Commented Nov 3, 2011 at 12:57
  • 1
    Or "pseudo"-two-part, like .uk.com Commented Nov 3, 2011 at 12:58
  • Pekka - in that case google.co.uk meaning the main domain. The thing you would buy from the domain registrar if that makes sense. Commented Nov 3, 2011 at 13:00
  • Have a look at this page: wiki.mozilla.org/TLD_List - as you can see, there are more exceptions than rules. You can probably use that list to do some sort of parsing, but I'm not sure how useful that would be. Commented Nov 3, 2011 at 13:03
  • 2
    @David19801 I think the point he was making is that there is no automatic way to do it because some TLD's consist of 2 parts (eg. .co.uk). Commented Nov 3, 2011 at 13:04

3 Answers 3

1

If you're only concerned with actual top level domains, the simple answer is to just get whatever's before the last dot in the domain name.

However, if you're looking for "whatever you buy from a registrar", that is much more tricky. IANA delegats authority for each country-specific TLD to the national registrars, which means that allocation policy varies for each TLD. Famous examples include .co.uk, .org.uk, etc, but there are countless others that are less known (for example .priv.no).

If you need a solution that will work correctly for every single TLD in existence, you will have to research policy for each TLD, which is quite an undertaking since many national registrars have horrible websites with unclear policies that, just to make it even more confusin, often are not available in English.

In practice however, you probably don't need to account for every TLD or for every available subdomain within every TLD. So a practical solution would be to compile a list of known 2-part (and more) TLD's that you need to support. Anything that doesn't match that list, you can treat as a 1-part TLD. Like so:

<?php
$special_domains = array('co.uk', 'org.uk, /* ... etc */');

function getDomain($domain)
{
    global $special_domains;

    for($i = 0; $i < count($special_domains); $i++)
    {
        if(substr($domain, -strlen($special_domains[i])) == $special_domains[i])
        {
            $domain = substr($domain, 0, -strlen($special_domains[i])));
            $lastdot = strrchr($domain, '.');

            return ($lastdot ? substr($domain, $lastdot) : $domain;
        }

        $domain = substr($domain, 0, strrchr($domain, "."));
        $lastdot = strrchr($domain, '.');

        return ($lastdot ? substr($domain, $lastdot) : $domain;
    }
}
?>

PS: I haven't tested this code so it may need some modification but the basic logic should be ok.

Sign up to request clarification or add additional context in comments.

Comments

0

There might be a work-around for .co.uk problem.

Let's presume that if it is possible to register *.co.uk, *.org.uk, *.mil.ae and similar domains, then it is not possible to resolve DNS of co.uk, org.uk and mil.ae. I've checked some URL's and it seemed to be true.

Then you can use something like this:

$testdomains = array(
    'http://google.com',
    'http://probablynotexisting.com',
    'http://subdomain.bbc.co.uk', // should resolve into bbc.co.uk, because it is not possible to ping co.uk
    'http://bbc.co.uk'
);

foreach ($testdomains as $raw_domain) {

    $domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -2));

    $ip = gethostbyname($domain);

    if ($ip == $domain) {
        // failure, let's include another dot
        $domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -3));

        $ip = gethostbyname($domain);
        if ($ip == $domain) {
            // another failure, shall we give up and move on!
            echo $raw_domain . ": failed<br />\n";
            continue;
        }
    }

    echo $raw_domain . ' -> ' . $domain . ": ok [" . $ip . "]<br />\n";

}

The output is like this:

http://google.com -> google.com: ok [72.14.204.147]
http://probablynotexisting.com: failed
http://subdomain.bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]
http://bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]

Note: resolving DNS is a slow process.

4 Comments

One issue though: A domain name may be both valid and registered even if it doesn't resolve in DNS.
Well, testing an existence/validity of a domain is a whole different issue. My idea was just to check if co.uk does not resolve, but bbc.co.uk does, then the domain part shall be bbc.co.uk, rather then co.uk.
But what if you're looking for foo.x354912ceg.com ? x354912ceg.com doesn't resolve, so your solution would treat "foo" as the domain when it should be "x354912ceg".
This is a clever idea! There are some pitfalls (e.g. .uk.com type, privately held "subdomain" providers) but still - interesting.
0

Let dig do the hard work for you. Extract the required base domain from the first field in the AUTHORITY section of a dig on any sub-domain (which doesn't need to exist) of the sub-domain/domain in question. Examples (in bash not php sorry)...

dig @8.8.8.8 notexist.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.

or

dig @8.8.8.8 notexist.test.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.

or

dig @8.8.8.8 notexist.www.xn--zgb6acm.xn--mgberp4a5d4ar|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
xn--zgb6acm.xn--mgberp4a5d4ar.

Where

grep -A1 filters out all lines except the line with the string ;; AUTHORITY SECTION: and 1 line after it.

tail -n1 leaves only the last 1 line of the above 2 lines.

sed "s/[[:space:]]\+/~/g" replaces dig's delimeters (1 or more consecutive spaces or tabs) with some custom delimiter ~. Could be any character which never occurs on the line.

cut -d'~' -f1 extracts the first field where the fields are delimited by the custom delimiter from above.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.