3

I have this URL :

http://www.exmaple.com/boo/a.php?a=jsd

and what i want the output is something like this :

http://www.exmaple.com/boo/

like wise if i have

http://www.exmaple.com/abc.html

it should be

http://www.exmaple.com/

and

http://www.exmaple.com/

should return

http://www.exmaple.com/

without any change

This is what i have tried

re.sub(r'\?[\S]+','',"http://www.exmaple.com/boo/a.php?a=jsd")

but it returns

http://www.exmaple.com/boo/a.php

Any suggestions what could be done to get the correct output or does anyone have any better ideas to get this done ?

3
  • 3
    Is the urlparse module not good enough? Commented Jan 8, 2013 at 14:24
  • 1
    Is it 'exmaple' everywhere on purpose? Commented Jan 8, 2013 at 14:32
  • @MartijnPieters yes url parse was what i needed!Thanks Commented Jan 8, 2013 at 14:39

3 Answers 3

5

Please, use the stdlib urlparse module, like this. Generally, I try to avoid regex unless it is absolutely necessary.

from urlparse import urlparse, urlunparse
>>> parsed = urlparse("http://www.exmaple.com/boo/a.php?a=jsd")
>>> scheme, netloc, path, params, query, fragment = parsed
>>> urlunparse((scheme,netloc,path.split('/')[1],'','',''))
'http://www.exmaple.com/boo'
Sign up to request clarification or add additional context in comments.

1 Comment

Yes, however, the path.split part requires some tune-up (check out http://www.exmaple.com/).
1

I would do something like that:

>>> import re
>>> url = "http://www.exmaple.com/boo/a.php?a=jsd"
>>> url[:url.rfind("/")+1]
'http://www.exmaple.com/boo/'

To remove everything that is after the last "/". I am not sure it covers all special cases though...

EDIT: New solution using urlparse and my simple rfind:

import re, urlparse
def url_cutter(url):
    up = urlparse.urlparse(url)
    url2 = up[0]+"://"+up[1]+up[2]
    if url.rfind("/")>6:
            url2 = url2[:url2.rfind("/")+1]
    return url2

Then:

In [36]: url_cutter("http://www.exmaple.com/boo/a.php?a=jsd")
Out[36]: 'http://www.exmaple.com/boo/'

In [37]: url_cutter("http://www.exmaple.com/boo/a.php?a=jsd#dvt_on")
Out[37]: 'http://www.exmaple.com/boo/'

In [38]: url_cutter("http://www.exmaple.com")
Out[38]: 'http://www.exmaple.com'

2 Comments

Indeed @MevinBabu, one can add a simple test like if url.rfind("/")>6 to avoid this case.
It fails if the url has #fragment/with/slashes
0

There might be a more optimized way to do it but with this one you won't need an obscure import or third party package.

url = "http://www.google.com/abc/abc.html?q=test"
cleaned_url = url[:url.rindex("?")]
cleaned_url = cleaned_url.split("/")
cleaned_url = [item for item in cleaned_url if ".html" not in item]
cleaned_url = "/".join(cleaned_url)

1 Comment

You might want to test if url.rindex sends you back an error in case "?" is not present in the string

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.