2

Ok how do i use regex to remove http AND/OR www just to get http://www.domain.com/ into domain.com

Assume x as any kind of TLD or cTLD

Input example:

http://www.domain.x/

www.domain.x

Output:

domain.x

2
  • str.lstrip([chars]) Return a copy of the string with leading characters removed. The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace. The chars argument is not a prefix; rather, all combinations of its values are stripped: >>> ' spacious '.lstrip() 'spacious ' >>> 'www.example.com'.lstrip('cmowz.') 'example.com' Commented Jun 28, 2012 at 10:09
  • It is worth mentioning that there are also www-pub, www-groups, www2, www3 and other www like prefixes Commented Mar 29, 2013 at 11:39

3 Answers 3

7

Don't use regex, use urlparse to get netloc

>>> x = 'http://www.domain.com/'
>>> from urlparse import urlparse
>>> o = urlparse(x)
>>> o
ParseResult(scheme='http', netloc='www.domain.com', path='/', params='', query='', fragment='')
>>> 

and then

>>> o.netloc
'www.domain.com'
>>> if o.netloc.startswith('www.'): print o.netloc[4:]
... 
domain.com
>>> 
Sign up to request clarification or add additional context in comments.

3 Comments

o.netloc.startswith('www.') would be more appropriate than 'www' in o.netloc
@Janne Karila: Thanks Janne. Lost that completely in quick answering. Thats ofcourse the correct way and not the one I presented. It is infact incorrect.
python 3.5 : from urllib.parse import urlparse
4

If you really want to use regular expressions instead of urlparse() or splitting the string:

>>> domain = 'http://www.example.com/'
>>> re.match(r'(?:\w*://)?(?:.*\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*', domain).groups()[0]
example.com

The regular expression might a bit simplistic, but works. It's also not replacing, but I think getting the domain out is easier.

To support domains like 'co.uk', one can do the following:

>>> p = re.compile(r'(?:\w*://)?(?:.*?\.)?(?:([a-zA-Z-1-9]*)\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*')
>>> p.match(domain).groups()

('google', 'co.uk')

So you got to check the result for domains like 'co.uk', and join the result again in such a case. Normal domains should work OK. I could not make it work when you have multiple subdomains.

One-liner without regular expressions or fancy modules:

>>> domain = 'http://www.example.com/'
>>> '.'.join(domain.replace('http://','').split('/')[0].split('.')[-2:])

3 Comments

I managed to paste the wrong regex in my initial post, but it's now edited with the correct one.
@Natsume Made me think, and I've updated the regex so 'http://' is optional and it accept any protocol, like 'https://' or 'bzr://'.
@Wooble I wouldn't say horribly, since it returns 'co.uk', but I understand the problem. I'm adding a solution for this.
1

Here is one of the way to do it:

    >>>import re
    >>>str1 = 'http://www.domain.x/'
    >>>p1 = re.compile('http://www.|/')
    >>>out = p1.sub('',str1)

2 Comments

Nice, but it does not cover where 'www' would be missing from the URL.
one can use match from re as below to check if required substring 'www' exists or not : >>> print p1.match("www")

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.