How to extract top-level domain name (TLD) from URL

Question

how would you extract the domain name from a URL, excluding any subdomains?

My initial simplistic attempt was:

'.'.join(urlparse.urlparse(url).netloc.split('.')[-2:])

This works for http://www.foo.com, but not http://www.foo.com.au. Is there a way to do this properly without using special knowledge about valid TLDs (Top Level Domains) or country codes (because they change).

thanks

A related question previously on Stack Overflow: stackoverflow.com/questions/569137/… — Conspicuous Compiler
– Conspicuous Compiler, Commented Jul 1, 2009 at 1:48
+1: The "simplistic attempt" in this question works well for me, even if it ironically didn't work for the author. — ArtOfWarfare
– ArtOfWarfare, Commented Jun 24, 2014 at 15:12

user2314737 · Accepted Answer · 2014-12-01 11:02:21Z

70

Here's a great python module someone wrote to solve this problem after seeing this question: https://github.com/john-kurkowski/tldextract

The module looks up TLDs in the Public Suffix List, mantained by Mozilla volunteers

Quote:

tldextract on the other hand knows what all gTLDs [Generic Top-Level Domains] and ccTLDs [Country Code Top-Level Domains] look like by looking up the currently living ones according to the Public Suffix List. So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

edited Dec 1, 2014 at 11:02

user2314737

29.7k20 gold badges109 silver badges126 bronze badges

answered Sep 12, 2011 at 13:46

Acorn

50.8k30 gold badges143 silver badges180 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

szeitlin Over a year ago

This worked for me where tld failed (it marked a valid URL as invalid).

Karl Lorey Over a year ago

Lost too much time thinking about the problem, should have known and used this from the start.

Alex Martelli · Accepted Answer · 2009-07-01 01:48:50Z

57

No, there is no "intrinsic" way of knowing that (e.g.) zap.co.it is a subdomain (because Italy's registrar DOES sell domains such as co.it) while zap.co.uk isn't (because the UK's registrar DOESN'T sell domains such as co.uk, but only like zap.co.uk).

You'll just have to use an auxiliary table (or online source) to tell you which TLD's behave peculiarly like UK's and Australia's -- there's no way of divining that from just staring at the string without such extra semantic knowledge (of course it can change eventually, but if you can find a good online source that source will also change accordingly, one hopes!-).

answered Jul 1, 2009 at 1:48

Alex Martelli

887k175 gold badges1.3k silver badges1.4k bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:10:30Z

44

Using this file of effective tlds which someone else found on Mozilla's website:

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tld_file:
    tlds = [line.strip() for line in tld_file if line[0] not in "/\n"]

def get_domain(url, tlds):
    url_elements = urlparse(url)[1].split('.')
    # url_elements = ["abcde","co","uk"]

    for i in range(-len(url_elements), 0):
        last_i_elements = url_elements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(last_i_elements) # abcde.co.uk, co.uk, uk
        wildcard_candidate = ".".join(["*"] + last_i_elements[1:]) # *.co.uk, *.uk, *
        exception_candidate = "!" + candidate

        # match tlds: 
        if (exception_candidate in tlds):
            return ".".join(url_elements[i:]) 
        if (candidate in tlds or wildcard_candidate in tlds):
            return ".".join(url_elements[i-1:])
            # returns "abcde.co.uk"

    raise ValueError("Domain not in global list of TLDs")

print get_domain("http://abcde.co.uk", tlds)

results in:

abcde.co.uk

I'd appreciate it if someone let me know which bits of the above could be rewritten in a more pythonic way. For example, there must be a better way of iterating over the last_i_elements list, but I couldn't think of one. I also don't know if ValueError is the best thing to raise. Comments?

edited May 23, 2017 at 12:10

CommunityBot

11 silver badge

answered Jul 1, 2009 at 15:23

Markus

3,5973 gold badges26 silver badges26 bronze badges

4 Comments

Bryce Thomas Over a year ago

If you need to call getDomain() often in practice, such as extracting domains from a large log file, I would recommend that you make tlds a set, e.g. tlds = set([line.strip() for line in tldFile if line[0] not in "/\n"]). This gives you constant time lookup for each of those checks for whether some item is in tlds. I saw a speedup of about 1500 times for the lookups (set vs. list) and for my entire operation extracting domains from a ~20 million line log file, about a 60 times speedup (6 minutes down from 6 hours).

kramer65 Over a year ago

This is awesome! Just one more question: is that effective_tld_names.dat file also updated for new domains such as .amsterdam, .vodka and .wtf?

tripleee Over a year ago

The Mozilla public suffix list gets regular maintenance, yes, and now has multiple Python libraries which include it. See publicsuffix.org and the other answers on this page.

Andrei Over a year ago

Some updates to get this right in 2021: the file is now called public_suffix_list.dat, and Python will complain if you don't specify that it should read the file as UTF8. Specify the encoding explicitly: with open("public_suffix_list.dat", encoding="utf8") as tld_file

Artur Barseghyan · Accepted Answer · 2018-06-14 11:58:23Z

42

Using python tld

https://pypi.python.org/pypi/tld

Install

pip install tld

Get the TLD name as string from the URL given

from tld import get_tld
print get_tld("http://www.google.co.uk")

co.uk

or without protocol

from tld import get_tld

get_tld("www.google.co.uk", fix_protocol=True)

co.uk

Get the TLD as an object

from tld import get_tld

res = get_tld("http://some.subdomain.google.co.uk", as_object=True)

res
# 'co.uk'

res.subdomain
# 'some.subdomain'

res.domain
# 'google'

res.tld
# 'co.uk'

res.fld
# 'google.co.uk'

res.parsed_url
# SplitResult(
#     scheme='http',
#     netloc='some.subdomain.google.co.uk',
#     path='',
#     query='',
#     fragment=''
# )

Get the first level domain name as string from the URL given

from tld import get_fld

get_fld("http://www.google.co.uk")
# 'google.co.uk'

edited Jun 14, 2018 at 11:58

answered May 16, 2013 at 6:46

Artur Barseghyan

14.5k5 gold badges57 silver badges47 bronze badges

8 Comments

Sjaak Trekhaak Over a year ago

This will become more unreliable with the new gTLDs.

Artur Barseghyan Over a year ago

Hey, thanks for pointing at this. I guess, when it comes to the point that new gTLDs are actually being used, a proper fix could come into the tld package.

Akshay Patil Over a year ago

Thank you @ArturBarseghyan ! Its very easy to use with Python. But I am using it now for enterprise grade product, is it a good idea to continue using it even if gTLDs are not being supported? If yes, when do you think gTLDs will be supported ? Thank you again.

Artur Barseghyan Over a year ago

@Akshay Patil: As stated above, when it comes to the point that gTLDs are intensively used, a proper fix (if possible) would arrive in the package. In the meanwhile, if you're concerned much about gTLDs, you can always catch the tld.exceptions.TldDomainNotFound exception and proceed anyway with whatever you were doing, even if domain hasn't been found.

Marian Over a year ago

Is it just me, or does tld.get_tld() actually return a fully qualified domain name, not a top level domain?

|

S.Lott · Accepted Answer · 2009-07-01 10:49:11Z

2

There are many, many TLD's. Here's the list:

http://data.iana.org/TLD/tlds-alpha-by-domain.txt

Here's another list

http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

Here's another list

http://www.iana.org/domains/root/db/

edited Jul 1, 2009 at 10:49

answered Jul 1, 2009 at 1:51

S.Lott

393k83 gold badges520 silver badges791 bronze badges

2 Comments

Lennart Regebro Over a year ago

That doesn't help, because it doesn't tell you which ones have an "extra level", like co.uk.

lprsd Over a year ago

Lennart: It helps, U can wrap them to be optional, within a regex.

Russ Savage · Accepted Answer · 2015-04-08 21:36:29Z

0

Until get_tld is updated for all the new ones, I pull the tld from the error. Sure it's bad code but it works.

def get_tld():
  try:
    return get_tld(self.content_url)
  except Exception, e:
    re_domain = re.compile("Domain ([^ ]+) didn't match any existing TLD name!");
    matchObj = re_domain.findall(str(e))
    if matchObj:
      for m in matchObj:
        return m
    raise e

answered Apr 8, 2015 at 21:36

Russ Savage

1,52412 silver badges26 bronze badges

Comments

Ryan Buckley · Accepted Answer · 2013-03-19 18:53:47Z

-1

Here's how I handle it:

if not url.startswith('http'):
    url = 'http://'+url
website = urlparse.urlparse(url)[1]
domain = ('.').join(website.split('.')[-2:])
match = re.search(r'((www\.)?([A-Z0-9.-]+\.[A-Z]{2,4}))', domain, re.I)
if not match:
    sys.exit(2)
elif not match.group(0):
    sys.exit(2)

answered Mar 19, 2013 at 18:53

Ryan Buckley

1191 silver badge3 bronze badges

1 Comment

Sri Over a year ago

There is a domain called .travel. It won't work with the above code.

Korayem · Accepted Answer · 2019-07-04 08:34:51Z

-1

In Python I used to use tldextract until it failed with a url like www.mybrand.sa.com parsing it as subdomain='order.mybrand', domain='sa', suffix='com'!!

So finally, I decided to write this method

IMPORTANT NOTE: this only works with urls that have a subdomain in them. This isn't meant to replace more advanced libraries like tldextract

def urlextract(url):
  url_split=url.split(".")
  if len(url_split) <= 2:
      raise Exception("Full url required with subdomain:",url)
  return {'subdomain': url_split[0], 'domain': url_split[1], 'suffix': ".".join(url_split[2:])}

edited Jul 4, 2019 at 8:34

answered May 28, 2019 at 16:45

Korayem

12.6k5 gold badges75 silver badges60 bronze badges

Collectives™ on Stack Overflow

How to extract top-level domain name (TLD) from URL

8 Answers 8

2 Comments

Comments

4 Comments

Install

Get the TLD name as string from the URL given

Get the TLD as an object

Get the first level domain name as string from the URL given

8 Comments

2 Comments

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

2 Comments

Comments

4 Comments

Install

Get the TLD name as string from the URL given

Get the TLD as an object

Get the first level domain name as string from the URL given

8 Comments

2 Comments

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related