Strip URL - Python

Question

Ok how do i use regex to remove http AND/OR www just to get http://www.domain.com/ into domain.com

Assume x as any kind of TLD or cTLD

Input example:

http://www.domain.x/

www.domain.x

Output:

domain.x

str.lstrip([chars]) Return a copy of the string with leading characters removed. The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace. The chars argument is not a prefix; rather, all combinations of its values are stripped: >>> ' spacious '.lstrip() 'spacious ' >>> 'www.example.com'.lstrip('cmowz.') 'example.com' — doniyor
– doniyor, Commented Jun 28, 2012 at 10:09
It is worth mentioning that there are also www-pub, www-groups, www2, www3 and other www like prefixes — Unicorn
– Unicorn, Commented Mar 29, 2013 at 11:39

pyfunc · Accepted Answer · 2012-06-28 17:10:27Z

7

Don't use regex, use urlparse to get netloc

>>> x = 'http://www.domain.com/'
>>> from urlparse import urlparse
>>> o = urlparse(x)
>>> o
ParseResult(scheme='http', netloc='www.domain.com', path='/', params='', query='', fragment='')
>>>

and then

>>> o.netloc
'www.domain.com'
>>> if o.netloc.startswith('www.'): print o.netloc[4:]
... 
domain.com
>>>

edited Jun 28, 2012 at 17:10

answered Jun 28, 2012 at 10:10

pyfunc

67k15 gold badges155 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Janne Karila Over a year ago

o.netloc.startswith('www.') would be more appropriate than 'www' in o.netloc

pyfunc Over a year ago

@Janne Karila: Thanks Janne. Lost that completely in quick answering. Thats ofcourse the correct way and not the one I presented. It is infact incorrect.

firephil Over a year ago

python 3.5 : from urllib.parse import urlparse

geertjanvdk · Accepted Answer · 2012-06-28 12:59:24Z

4

If you really want to use regular expressions instead of urlparse() or splitting the string:

>>> domain = 'http://www.example.com/'
>>> re.match(r'(?:\w*://)?(?:.*\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*', domain).groups()[0]
example.com

The regular expression might a bit simplistic, but works. It's also not replacing, but I think getting the domain out is easier.

To support domains like 'co.uk', one can do the following:

>>> p = re.compile(r'(?:\w*://)?(?:.*?\.)?(?:([a-zA-Z-1-9]*)\.)?([a-zA-Z-1-9]*\.[a-zA-Z]{1,}).*')
>>> p.match(domain).groups()

('google', 'co.uk')

So you got to check the result for domains like 'co.uk', and join the result again in such a case. Normal domains should work OK. I could not make it work when you have multiple subdomains.

One-liner without regular expressions or fancy modules:

>>> domain = 'http://www.example.com/'
>>> '.'.join(domain.replace('http://','').split('/')[0].split('.')[-2:])

edited Jun 28, 2012 at 12:59

answered Jun 28, 2012 at 10:28

geertjanvdk

3,52827 silver badges27 bronze badges

3 Comments

geertjanvdk Over a year ago

I managed to paste the wrong regex in my initial post, but it's now edited with the correct one.

geertjanvdk Over a year ago

@Natsume Made me think, and I've updated the regex so 'http://' is optional and it accept any protocol, like 'https://' or 'bzr://'.

geertjanvdk Over a year ago

@Wooble I wouldn't say horribly, since it returns 'co.uk', but I understand the problem. I'm adding a solution for this.

Thiem Nguyen · Accepted Answer · 2012-06-28 17:36:28Z

1

Here is one of the way to do it:

    >>>import re
    >>>str1 = 'http://www.domain.x/'
    >>>p1 = re.compile('http://www.|/')
    >>>out = p1.sub('',str1)

edited Jun 28, 2012 at 17:36

Thiem Nguyen

6,3657 gold badges33 silver badges50 bronze badges

answered Jun 28, 2012 at 10:26

user1242393

315 bronze badges

2 Comments

geertjanvdk Over a year ago

Nice, but it does not cover where 'www' would be missing from the URL.

user1242393 Over a year ago

one can use match from re as below to check if required substring 'www' exists or not : >>> print p1.match("www")

Collectives™ on Stack Overflow

Strip URL - Python

3 Answers 3

3 Comments

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related