Get subdomain from URL using Python

Question

For example, the address is:

Address = http://lol1.domain.com:8888/some/page

I want to save the subdomain into a variable so i could do like so;

print SubAddr
>> lol1

This questions should be useful: stackoverflow.com/questions/1066933/… — Acorn
– Acorn, Commented Aug 3, 2011 at 11:47

wjandrea · Accepted Answer · 2022-09-17 16:15:09Z

32

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:

>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')

Note that tldextract properly handles sub-domains.

edited Sep 17, 2022 at 16:15

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered May 1, 2015 at 13:05

Lluís Vilanova

9178 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Tom St Over a year ago

great answer, should be voted as the best one :) thanks Lluis

radtek · Accepted Answer · 2022-03-29 17:12:58Z

18

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.

import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]

edited Mar 29, 2022 at 17:12

radtek

36.6k13 gold badges149 silver badges114 bronze badges

answered Aug 3, 2011 at 11:47

Daniel Roseman

602k68 gold badges910 silver badges923 bronze badges

9 Comments

Marko Over a year ago

Works very good. I used it like so Node = urlparse.urlparse(address).hostname.split('.')[0]

naktinis Over a year ago

What if it's an IP address? And what if it has a second level subdomain?

sidneydobber Over a year ago

Subdomains may contain multiple dots so api.test is also valid, just keep this in mind. If you want a good package for doing this check https://pypi.python.org/pypi/tldextract.

mlissner Over a year ago

This is actually a pretty bad answer. It fails if there's no subdomain, returning the domain instead. It fails for IP addresses (ok, fine), and it fails for multiple subdomains, like web.host1.google.com.

Lord Elrond Over a year ago

in python 3.x you need to import this via from urllib.parse import urlparse

|

Community · Accepted Answer · 2017-05-23 12:25:31Z

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL

You will need the list of effective tlds from here

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

class DomainParts(object):
    def __init__(self, domain_parts, tld):
        self.domain = None
        self.subdomains = None
        self.tld = tld
        if domain_parts:
            self.domain = domain_parts[-1]
            if len(domain_parts) > 1:
                self.subdomains = domain_parts[:-1]

def get_domain_parts(url, tlds):
    urlElements = urlparse(url).hostname.split('.')
    # urlElements = ["abcde","co","uk"]
    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
            # returns ["abcde"]

    raise ValueError("Domain not in global list of TLDs")

domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld

Gives you:

Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk

Updated link to "list of effective tlds": wiki.mozilla.org/Public_Suffix_List#TLD_Lists, publicsuffix.org

Steve Mayne · Accepted Answer · 2011-08-03 11:44:39Z

4

A very basic approach, without any sanity checking could look like:

address = 'http://lol1.domain.com:8888/some/page'

host = address.partition('://')[2]
sub_addr = host.partition('.')[0]

print sub_addr

This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:

http://www.google.com/

Is that what you mean?

answered Aug 3, 2011 at 11:44

Steve Mayne

23k4 gold badges53 silver badges49 bronze badges

Comments

Benjamin K. · Accepted Answer · 2011-08-03 11:48:05Z

2

What you are looking for is in: http://docs.python.org/library/urlparse.html

for example: ".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])

Will do the job for you (will return "www.my")

answered Aug 3, 2011 at 11:48

Benjamin K.

1,1053 gold badges15 silver badges24 bronze badges

2 Comments

Thomas K Over a year ago

This assumes that the main domain name has two parts - which will fall down in certain cases, e.g. .co.uk addresses. Besides the UK, Israel, Brasil and Japan all have formal second level domains, and there are probably others.

Acorn Over a year ago

My answer deals with this problem using a list of valid TLDs.

user14335364 · Accepted Answer · 2020-12-26 06:23:49Z

First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.

import tldextract

Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:

ext = tldextract.extract("http://lol1.domain.com:8888/some/page")

If we simply try to run ext variable, the output will be:

ExtractResult(subdomain='lol1', domain='domain', suffix='com')

Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.

ext.subdomain

The result will be:

'lol1'

ext.domain

The result will be:

'domain'

ext.suffix

The result will be:

'com'

Also, if you want to store the results only of subdomain in a variable, then use the code below:

Sub_Domain = ext.subdomain

Then Print Sub_Domain

Sub_Domain

The result will be:

'lol1'

Pausi · Accepted Answer · 2022-02-18 16:35:18Z

import re

def extract_domain(domain):
   domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
   if matches:
       return matches[0]
   else:
       return domain

def extract_subdomains(domain):
   subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   domain = extract_domain(subdomains)
   subdomains = re.sub('\.?'+domain,'', subdomains)
   return subdomains

Example to fetch subdomains:

print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))

Outputs:

lol1
kota-tangerang

Example to fetch domain

print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))

Outputs:

domain.com
kpu.go.id

Andres R · Accepted Answer · 2022-05-20 15:10:38Z

1

Standardize all domains to start with www. unless they have a subdomain.

from urllib.parse import urlparse
    
def has_subdomain(url):
    if len(url.split('.')) > 2:
        return True
    else:
        return False 

domain = urlparse(url).netloc
        
if not has_subdomain(url):
        domain_name = 'www.' + domain
        url = urlparse(url).scheme + '://' + domain

answered May 20, 2022 at 15:10

Andres R

1751 silver badge6 bronze badges

Comments

MattH · Accepted Answer · 2011-08-03 11:46:05Z

0

For extracting the hostname, I'd use urlparse from urllib2:

>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'

As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.

E.g.

>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'

answered Aug 3, 2011 at 11:46

MattH

38.4k11 gold badges85 silver badges84 bronze badges

Comments

Prachit Patil · Accepted Answer · 2018-10-02 17:52:51Z

0

We can use https://github.com/john-kurkowski/tldextract for this problem...

It's easy.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')

answered Oct 2, 2018 at 17:52

Prachit Patil

4616 silver badges11 bronze badges

Comments

ozturkib · Accepted Answer · 2020-11-02 14:25:41Z

0

tldextract separate the TLD from the registered domain and subdomains of a URL.

Installation

pip install tldextract

For the current question:

import tldextract

address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)

The output:

Extracted domain name :  domain

In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.

answered Nov 2, 2020 at 14:25

ozturkib

1,65318 silver badges32 bronze badges

Comments

s3bw · Accepted Answer · 2021-11-23 13:51:08Z

0

Using python 3 (I'm using 3.9 to be specific), you can do the following:

from urllib.parse import urlparse

address = 'http://lol1.domain.com:8888/some/page'

url = urlparse(address)

url.hostname.split('.')[0]

answered Nov 23, 2021 at 13:51

s3bw

3,0872 gold badges23 silver badges31 bronze badges

Collectives™ on Stack Overflow

Get subdomain from URL using Python

12 Answers 12

1 Comment

9 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

12 Answers 12

1 Comment

9 Comments

1 Comment

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related