15

For example, the address is:

Address = http://lol1.domain.com:8888/some/page

I want to save the subdomain into a variable so i could do like so;

print SubAddr
>> lol1
1

12 Answers 12

32

Package tldextract makes this task very easy, and then you can use urlparse as suggested if you need any further information:

>>> import tldextract
>>> tldextract.extract("http://lol1.domain.com:8888/some/page"
ExtractResult(subdomain='lol1', domain='domain', suffix='com')
>>> tldextract.extract("http://sub.lol1.domain.com:8888/some/page"
ExtractResult(subdomain='sub.lol1', domain='domain', suffix='com')
>>> urlparse.urlparse("http://sub.lol1.domain.com:8888/some/page")
ParseResult(scheme='http', netloc='sub.lol1.domain.com:8888', path='/some/page', params='', query='', fragment='')

Note that tldextract properly handles sub-domains.

Sign up to request clarification or add additional context in comments.

1 Comment

great answer, should be voted as the best one :) thanks Lluis
18

urlparse.urlparse will split the URL into protocol, location, port, etc. You can then split the location by . to get the subdomain.

import urlparse
url = urlparse.urlparse(address)
subdomain = url.hostname.split('.')[0]

9 Comments

Works very good. I used it like so Node = urlparse.urlparse(address).hostname.split('.')[0]
What if it's an IP address? And what if it has a second level subdomain?
Subdomains may contain multiple dots so api.test is also valid, just keep this in mind. If you want a good package for doing this check https://pypi.python.org/pypi/tldextract.
This is actually a pretty bad answer. It fails if there's no subdomain, returning the domain instead. It fails for IP addresses (ok, fine), and it fails for multiple subdomains, like web.host1.google.com.
in python 3.x you need to import this via from urllib.parse import urlparse
|
5

Modified version of the fantastic answer here: How to extract top-level domain name (TLD) from URL

You will need the list of effective tlds from here

from __future__ import with_statement
from urlparse import urlparse

# load tlds, ignore comments and empty lines:
with open("effective_tld_names.dat.txt") as tldFile:
    tlds = [line.strip() for line in tldFile if line[0] not in "/\n"]

class DomainParts(object):
    def __init__(self, domain_parts, tld):
        self.domain = None
        self.subdomains = None
        self.tld = tld
        if domain_parts:
            self.domain = domain_parts[-1]
            if len(domain_parts) > 1:
                self.subdomains = domain_parts[:-1]

def get_domain_parts(url, tlds):
    urlElements = urlparse(url).hostname.split('.')
    # urlElements = ["abcde","co","uk"]
    for i in range(-len(urlElements),0):
        lastIElements = urlElements[i:]
        #    i=-3: ["abcde","co","uk"]
        #    i=-2: ["co","uk"]
        #    i=-1: ["uk"] etc

        candidate = ".".join(lastIElements) # abcde.co.uk, co.uk, uk
        wildcardCandidate = ".".join(["*"]+lastIElements[1:]) # *.co.uk, *.uk, *
        exceptionCandidate = "!"+candidate

        # match tlds: 
        if (exceptionCandidate in tlds):
            return ".".join(urlElements[i:]) 
        if (candidate in tlds or wildcardCandidate in tlds):
            return DomainParts(urlElements[:i], '.'.join(urlElements[i:]))
            # returns ["abcde"]

    raise ValueError("Domain not in global list of TLDs")

domain_parts = get_domain_parts("http://sub2.sub1.example.co.uk:80",tlds)
print "Domain:", domain_parts.domain
print "Subdomains:", domain_parts.subdomains or "None"
print "TLD:", domain_parts.tld

Gives you:

Domain: example
Subdomains: ['sub2', 'sub1']
TLD: co.uk

1 Comment

Updated link to "list of effective tlds": wiki.mozilla.org/Public_Suffix_List#TLD_Lists, publicsuffix.org
4

A very basic approach, without any sanity checking could look like:

address = 'http://lol1.domain.com:8888/some/page'

host = address.partition('://')[2]
sub_addr = host.partition('.')[0]

print sub_addr

This of course assumes that when you say 'subdomain' you mean the first part of a host name, so in the following case, 'www' would be the subdomain:

http://www.google.com/

Is that what you mean?

Comments

2

What you are looking for is in: http://docs.python.org/library/urlparse.html

for example: ".".join(urlparse('http://www.my.cwi.nl:80/%7Eguido/Python.html').netloc.split(".")[:-2])

Will do the job for you (will return "www.my")

2 Comments

This assumes that the main domain name has two parts - which will fall down in certain cases, e.g. .co.uk addresses. Besides the UK, Israel, Brasil and Japan all have formal second level domains, and there are probably others.
My answer deals with this problem using a list of valid TLDs.
1

First of All import tldextract, as this splits the URL into its constituents like: subdomain. domain, and suffix.

import tldextract

Then declare a variable (say ext) that stores the results of the query. We also have to provide it with the URL in parenthesis with double quotes. As shown below:

ext = tldextract.extract("http://lol1.domain.com:8888/some/page")

If we simply try to run ext variable, the output will be:

ExtractResult(subdomain='lol1', domain='domain', suffix='com')

Then if you want to use only subdomain or domain or suffix, then use any of the below code, respectively.

ext.subdomain

The result will be:

'lol1'
ext.domain

The result will be:

'domain'
ext.suffix

The result will be:

'com'

Also, if you want to store the results only of subdomain in a variable, then use the code below:

Sub_Domain = ext.subdomain

Then Print Sub_Domain

Sub_Domain

The result will be:

'lol1'

Comments

1
import re

def extract_domain(domain):
   domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   matches = re.findall("([a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$", domain)
   if matches:
       return matches[0]
   else:
       return domain

def extract_subdomains(domain):
   subdomains = domain = re.sub('http(s)?://|(\:|/)(.*)|','', domain)
   domain = extract_domain(subdomains)
   subdomains = re.sub('\.?'+domain,'', subdomains)
   return subdomains

Example to fetch subdomains:

print(extract_subdomains('http://lol1.domain.com:8888/some/page'))
print(extract_subdomains('kota-tangerang.kpu.go.id'))

Outputs:

lol1
kota-tangerang

Example to fetch domain

print(extract_domain('http://lol1.domain.com:8888/some/page'))
print(extract_domain('kota-tangerang.kpu.go.id'))

Outputs:

domain.com
kpu.go.id

Comments

1

Standardize all domains to start with www. unless they have a subdomain.

from urllib.parse import urlparse
    
def has_subdomain(url):
    if len(url.split('.')) > 2:
        return True
    else:
        return False 

domain = urlparse(url).netloc
        
if not has_subdomain(url):
        domain_name = 'www.' + domain
        url = urlparse(url).scheme + '://' + domain

Comments

0

For extracting the hostname, I'd use urlparse from urllib2:

>>> from urllib2 import urlparse
>>> a = "http://lol1.domain.com:8888/some/page"
>>> urlparse.urlparse(a).hostname
'lol1.domain.com'

As to how to extract the subdomain, you need to cover for the case that there FQDN could be longer. How you do this would depend on your purposes. I might suggest stripping off the two right most components.

E.g.

>>> urlparse.urlparse(a).hostname.rpartition('.')[0].rpartition('.')[0]
'lol1'

Comments

0

We can use https://github.com/john-kurkowski/tldextract for this problem...

It's easy.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')

Comments

0

tldextract separate the TLD from the registered domain and subdomains of a URL.

Installation

pip install tldextract

For the current question:

import tldextract

address = 'http://lol1.domain.com:8888/some/page'
domain = tldextract.extract(address).domain
print("Extracted domain name : ", domain)

The output:

Extracted domain name :  domain

In addition, there is a number of examples which is extremely related with the usage of tldextract.extract side.

Comments

0

Using python 3 (I'm using 3.9 to be specific), you can do the following:

from urllib.parse import urlparse

address = 'http://lol1.domain.com:8888/some/page'

url = urlparse(address)

url.hostname.split('.')[0]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.