PHP extract just the main domain not subdomain from URL

Question

I have a set of URL's in an array. Some are just domains (http://google.com) and some are subdomains (http://test.google.com).

I am trying to extract just the domain part from each of them without the subdomain.

parse_url($domain)

still keeps the subdomain.

Is there another way?

How would this work for two-part TLDs? Consider test.google.co.uk - which part would you want in that case? — Pekka
– Pekka, Commented Nov 3, 2011 at 12:57
Pekka - in that case google.co.uk meaning the main domain. The thing you would buy from the domain registrar if that makes sense. — David19801
– David19801, Commented Nov 3, 2011 at 13:00
Have a look at this page: wiki.mozilla.org/TLD_List - as you can see, there are more exceptions than rules. You can probably use that list to do some sort of parsing, but I'm not sure how useful that would be. — Aleks G
– Aleks G, Commented Nov 3, 2011 at 13:03
@David19801 I think the point he was making is that there is no automatic way to do it because some TLD's consist of 2 parts (eg. .co.uk). — Nico
– Nico, Commented Nov 3, 2011 at 13:04

Nico · Accepted Answer · 2012-08-22 08:51:36Z

If you're only concerned with actual top level domains, the simple answer is to just get whatever's before the last dot in the domain name.

However, if you're looking for "whatever you buy from a registrar", that is much more tricky. IANA delegats authority for each country-specific TLD to the national registrars, which means that allocation policy varies for each TLD. Famous examples include .co.uk, .org.uk, etc, but there are countless others that are less known (for example .priv.no).

If you need a solution that will work correctly for every single TLD in existence, you will have to research policy for each TLD, which is quite an undertaking since many national registrars have horrible websites with unclear policies that, just to make it even more confusin, often are not available in English.

In practice however, you probably don't need to account for every TLD or for every available subdomain within every TLD. So a practical solution would be to compile a list of known 2-part (and more) TLD's that you need to support. Anything that doesn't match that list, you can treat as a 1-part TLD. Like so:

<?php
$special_domains = array('co.uk', 'org.uk, /* ... etc */');

function getDomain($domain)
{
    global $special_domains;

    for($i = 0; $i < count($special_domains); $i++)
    {
        if(substr($domain, -strlen($special_domains[i])) == $special_domains[i])
        {
            $domain = substr($domain, 0, -strlen($special_domains[i])));
            $lastdot = strrchr($domain, '.');

            return ($lastdot ? substr($domain, $lastdot) : $domain;
        }

        $domain = substr($domain, 0, strrchr($domain, "."));
        $lastdot = strrchr($domain, '.');

        return ($lastdot ? substr($domain, $lastdot) : $domain;
    }
}
?>

PS: I haven't tested this code so it may need some modification but the basic logic should be ok.

Vilius Paulauskas · Accepted Answer · 2011-11-03 13:42:34Z

0

There might be a work-around for .co.uk problem.

Let's presume that if it is possible to register *.co.uk, *.org.uk, *.mil.ae and similar domains, then it is not possible to resolve DNS of co.uk, org.uk and mil.ae. I've checked some URL's and it seemed to be true.

Then you can use something like this:

$testdomains = array(
    'http://google.com',
    'http://probablynotexisting.com',
    'http://subdomain.bbc.co.uk', // should resolve into bbc.co.uk, because it is not possible to ping co.uk
    'http://bbc.co.uk'
);

foreach ($testdomains as $raw_domain) {

    $domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -2));

    $ip = gethostbyname($domain);

    if ($ip == $domain) {
        // failure, let's include another dot
        $domain = join('.', array_slice(explode('.', parse_url($raw_domain, PHP_URL_HOST)), -3));

        $ip = gethostbyname($domain);
        if ($ip == $domain) {
            // another failure, shall we give up and move on!
            echo $raw_domain . ": failed<br />\n";
            continue;
        }
    }

    echo $raw_domain . ' -> ' . $domain . ": ok [" . $ip . "]<br />\n";

}

The output is like this:

http://google.com -> google.com: ok [72.14.204.147]
http://probablynotexisting.com: failed
http://subdomain.bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]
http://bbc.co.uk -> bbc.co.uk: ok [212.58.241.131]

Note: resolving DNS is a slow process.

answered Nov 3, 2011 at 13:42

Vilius Paulauskas

3,2813 gold badges27 silver badges24 bronze badges

4 Comments

Nico Over a year ago

One issue though: A domain name may be both valid and registered even if it doesn't resolve in DNS.

Vilius Paulauskas Over a year ago

Well, testing an existence/validity of a domain is a whole different issue. My idea was just to check if co.uk does not resolve, but bbc.co.uk does, then the domain part shall be bbc.co.uk, rather then co.uk.

Nico Over a year ago

But what if you're looking for foo.x354912ceg.com ? x354912ceg.com doesn't resolve, so your solution would treat "foo" as the domain when it should be "x354912ceg".

Pekka Over a year ago

This is a clever idea! There are some pitfalls (e.g. .uk.com type, privately held "subdomain" providers) but still - interesting.

Stew-au · Accepted Answer · 2014-12-08 02:29:46Z

Let dig do the hard work for you. Extract the required base domain from the first field in the AUTHORITY section of a dig on any sub-domain (which doesn't need to exist) of the sub-domain/domain in question. Examples (in bash not php sorry)...

dig @8.8.8.8 notexist.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.

or

dig @8.8.8.8 notexist.test.google.com|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
google.com.

or

dig @8.8.8.8 notexist.www.xn--zgb6acm.xn--mgberp4a5d4ar|grep -A1 ';; AUTHORITY SECTION:'|tail -n1|sed "s/[[:space:]]\+/~/g"|cut -d'~' -f1
xn--zgb6acm.xn--mgberp4a5d4ar.

Where

grep -A1 filters out all lines except the line with the string ;; AUTHORITY SECTION: and 1 line after it.

tail -n1 leaves only the last 1 line of the above 2 lines.

sed "s/[[:space:]]\+/~/g" replaces dig's delimeters (1 or more consecutive spaces or tabs) with some custom delimiter ~. Could be any character which never occurs on the line.

cut -d'~' -f1 extracts the first field where the fields are delimited by the custom delimiter from above.

Collectives™ on Stack Overflow

PHP extract just the main domain not subdomain from URL

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related