0

I have a bunch of (ugly if I may say) urls, which I would like to clean up using python regex. So, my urls look something like:

http://www.thisislink1.com/this/is/sublink1/1
http://www.thisislink2.co.uk/this/is/sublink1s/klinks
http://www.thisislinkd.co/this/is/sublink1/hotlinks/2
http://www.thisislinkf.com.uk/this/is/sublink1d/morelink
http://www.thisislink1.co.in/this/is/sublink1c/mylink
....

What I'd like to do is clean up these urls, so that the final link looks like:

http://www.thisislink1.com
http://www.thisislink2.co.uk
http://www.thisislinkd.co
http://www.thisislinkf.de
http://www.thisislink1.us
....

and I was wondering how I can achieve this in a pythonic way. Sorry if this is a 101 question - I am new to pytho regex structures.

4 Answers 4

7

Use urlparse.urlsplit:

In [3]: import urlparse    

In [8]: url = urlparse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')

In [9]: url.netloc
Out[9]: 'www.thisislink1.com'

In Python3 it would be

import urllib.parse as parse
url = parse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
Sign up to request clarification or add additional context in comments.

Comments

6

Why use regex?

>>> import urlparse
>>> url = 'http://www.thisislinkd.co/this/is/sublink1/hotlinks/2'
>>> urlparse.urlsplit(url)
SplitResult(scheme='http', netloc='www.thisislinkd.co', path='/this/is/sublink1/hotlinks/2', query='', fragment='')

3 Comments

That is awesome. I did not know about urlparse - very handy I must say. Thanks again. I have accepted your answer. urlparse.urlsplit(url).netloc solved the problem.
just seen that @unutbu got there first (by a few seconds), go with theirs!
ok Jon - I will accept unutbu's answer - Thanks again for your help tho!
1

You should use a URL parser like others have suggested but for completeness here is a solution with regex:

import re

url='http://www.thisislink1.com/this/is/sublink1/1'

re.sub('(?<![/:])/.*','',url)

>>> 'http://www.thisislink1.com'

Explanation:

Match everything after and including the first forwardslash that is not preceded by a : or / and replace it with nothing ''.

(?<![/:]) # Negative lookbehind for '/' or ':'
/.*       # Match a / followed by anything

Comments

0

Maybe use something like this:

result = re.sub(r"(?m)(http://(www)?\..*?)/", r"\1", subject)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.