python regex urls

Question

I have a bunch of (ugly if I may say) urls, which I would like to clean up using python regex. So, my urls look something like:

http://www.thisislink1.com/this/is/sublink1/1
http://www.thisislink2.co.uk/this/is/sublink1s/klinks
http://www.thisislinkd.co/this/is/sublink1/hotlinks/2
http://www.thisislinkf.com.uk/this/is/sublink1d/morelink
http://www.thisislink1.co.in/this/is/sublink1c/mylink
....

What I'd like to do is clean up these urls, so that the final link looks like:

http://www.thisislink1.com
http://www.thisislink2.co.uk
http://www.thisislinkd.co
http://www.thisislinkf.de
http://www.thisislink1.us
....

and I was wondering how I can achieve this in a pythonic way. Sorry if this is a 101 question - I am new to pytho regex structures.

unutbu · Accepted Answer · 2012-12-07 12:46:00Z

7

Use urlparse.urlsplit:

In [3]: import urlparse    

In [8]: url = urlparse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')

In [9]: url.netloc
Out[9]: 'www.thisislink1.com'

In Python3 it would be

import urllib.parse as parse
url = parse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')

edited Dec 7, 2012 at 12:46

answered Dec 7, 2012 at 12:40

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jon Clements · Accepted Answer · 2012-12-07 12:41:47Z

6

Why use regex?

>>> import urlparse
>>> url = 'http://www.thisislinkd.co/this/is/sublink1/hotlinks/2'
>>> urlparse.urlsplit(url)
SplitResult(scheme='http', netloc='www.thisislinkd.co', path='/this/is/sublink1/hotlinks/2', query='', fragment='')

answered Dec 7, 2012 at 12:41

Jon Clements

143k34 gold badges254 silver badges288 bronze badges

3 Comments

AJW Over a year ago

That is awesome. I did not know about urlparse - very handy I must say. Thanks again. I have accepted your answer. urlparse.urlsplit(url).netloc solved the problem.

Jon Clements Over a year ago

just seen that @unutbu got there first (by a few seconds), go with theirs!

AJW Over a year ago

ok Jon - I will accept unutbu's answer - Thanks again for your help tho!

Chris Seymour · Accepted Answer · 2012-12-07 12:51:19Z

1

You should use a URL parser like others have suggested but for completeness here is a solution with regex:

import re

url='http://www.thisislink1.com/this/is/sublink1/1'

re.sub('(?<![/:])/.*','',url)

>>> 'http://www.thisislink1.com'

Explanation:

Match everything after and including the first forwardslash that is not preceded by a : or / and replace it with nothing ''.

(?<![/:]) # Negative lookbehind for '/' or ':'
/.*       # Match a / followed by anything

edited Dec 7, 2012 at 12:51

answered Dec 7, 2012 at 12:45

Chris Seymour

86.4k32 gold badges165 silver badges209 bronze badges

Comments

Andreas · Accepted Answer · 2012-12-07 12:45:41Z

0

Maybe use something like this:

result = re.sub(r"(?m)(http://(www)?\..*?)/", r"\1", subject)

answered Dec 7, 2012 at 12:45

Andreas

642 bronze badges

Collectives™ on Stack Overflow

python regex urls

4 Answers 4

Comments

3 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related