Extract domain using regular expression

Question

Suppose I got these urls.

http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu

I just want to extract "domainame.com" or "domainname.org" or "domainname.edu" out. How can I do this?

I think, I need to find the last "dot" just before "com|org|edu..." and print out content from this "dot"'s previous dot to this dot's next dot(if it has).

Need help about the regular-expres. Thanks a lot!!! I am using Python.

Call me simpleminded, but I wanted something similar (for cookie domain) and came up with this: print '.%s' % ( '.'.join( "host.dom.com".split('.')[-2:] ) ) gives .dom.com (in all cases I tried.) — MarkHu
– MarkHu, Commented Mar 1, 2014 at 0:16

Jase · Accepted Answer · 2011-03-30 21:47:21Z

12

why use regex?

http://docs.python.org/library/urlparse.html

answered Mar 30, 2011 at 21:47

Jase

5993 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2021-10-07 05:57:16Z

If you would like to go the regex route...

RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:

re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme    = $2
# authority = $4
# path      = $5
# query     = $7
# fragment  = $9

Here is an enhanced, Python friendly version which utilizes named capture groups. It is presented in a function within a working script:

import re

def get_domain(url):
    """Return top two domain levels from URI"""
    re_3986_enhanced = re.compile(r"""
        # Parse and capture RFC-3986 Generic URI components.
        ^                                    # anchor to beginning of string
        (?:  (?P<scheme>    [^:/?#\s]+): )?  # capture optional scheme
        (?://(?P<authority>  [^/?#\s]*)  )?  # capture optional authority
             (?P<path>        [^?#\s]*)      # capture required path
        (?:\?(?P<query>        [^#\s]*)  )?  # capture optional query
        (?:\#(?P<fragment>      [^\s]*)  )?  # capture optional fragment
        $                                    # anchor to end of string
        """, re.MULTILINE | re.VERBOSE)
    re_domain =  re.compile(r"""
        # Pick out top two levels of DNS domain from authority.
        (?P<domain>[^.]+\.[A-Za-z]{2,6})  # $domain: top two domain levels.
        (?::[0-9]*)?                      # Optional port number.
        $                                 # Anchor to end of string.
        """, 
        re.MULTILINE | re.VERBOSE)
    result = ""
    m_uri = re_3986_enhanced.match(url)
    if m_uri and m_uri.group("authority"):
        auth = m_uri.group("authority")
        m_domain = re_domain.search(auth)
        if m_domain and m_domain.group("domain"):
            result = m_domain.group("domain");
    return result

data_list = [
    r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
    r"http://mail.domainname.org/abc/abc/aaa",
    r"http://domainname.edu",
    r"http://domainname.com:80",
    r"http://domainname.com?query=one",
    r"http://domainname.com#fragment",
    ]
cnt = 0
for data in data_list:
    cnt += 1
    print("Data[%d] domain = \"%s\"" %
        (cnt, get_domain(data)))

For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation

Mathias Nielsen · Accepted Answer · 2011-03-30 21:53:18Z

1

In addition to Jase' answer. If you don't wan't to use urlparse, just split the URL's.

Strip of the protocol (http:// or https://) The you just split the string by first occurrence of '/'. This will leave you with something like: 'mail.domainname.org' on the second URL. This can then be split by '.' and the you just select the last two from the list by [-2]

This will always yield the domainname.org or whatever. Provided you get the protocol stripped out right, and that the URL are valid.

I would just use urlparse, but it can be done. Dunno about the regex, but this is how I would do it.

answered Mar 30, 2011 at 21:53

Mathias Nielsen

1,6202 gold badges18 silver badges31 bronze badges

Comments

undefined · Accepted Answer · 2011-03-30 22:27:51Z

Should you need more flexibility than urlparse provides, here's an example to get you started:

import re
def getDomain(url):
    #requires 'http://' or 'https://'
    #pat = r'(https?):\/\/(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    #'http://' or 'https://' is optional
    pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
    m = re.match(pat, url)
    if m:
        domain = m.group('domain')
        return domain
    else:
        return False

I used the named group (?P<domain>\w+) to grab the match, which is then indexed by its name, m.group('domain'). The great thing about learning regular expressions is that once you are comfortable with them, solving even the most complicated parsing problems is relatively simple. This pattern could be improved to be more or less forgiving if necessary -- this one for example will return '678' if you pass it 'http://123.45.678.90', but should work great on just about any other URL you can come up with. Regexr is a great resource for learning and testing regexes.

Collectives™ on Stack Overflow

Extract domain using regular expression

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related