Remove GET variables from URL in python

Question

I have this URL :

http://www.exmaple.com/boo/a.php?a=jsd

and what i want the output is something like this :

http://www.exmaple.com/boo/

like wise if i have

http://www.exmaple.com/abc.html

it should be

http://www.exmaple.com/

and

http://www.exmaple.com/

should return

http://www.exmaple.com/

without any change

This is what i have tried

re.sub(r'\?[\S]+','',"http://www.exmaple.com/boo/a.php?a=jsd")

but it returns

http://www.exmaple.com/boo/a.php

Any suggestions what could be done to get the correct output or does anyone have any better ideas to get this done ?

Is the urlparse module not good enough?

Martijn Pieters
– Martijn Pieters

2013-01-08 14:24:23 +00:00
Commented Jan 8, 2013 at 14:24 — Martijn Pieters
– Martijn Pieters, Commented Jan 8, 2013 at 14:24
Is it 'exmaple' everywhere on purpose?

Dhara
– Dhara

2013-01-08 14:32:11 +00:00
Commented Jan 8, 2013 at 14:32 — Dhara
– Dhara, Commented Jan 8, 2013 at 14:32
@MartijnPieters yes url parse was what i needed!Thanks

Mevin Babu
– Mevin Babu

2013-01-08 14:39:22 +00:00
Commented Jan 8, 2013 at 14:39 — Mevin Babu
– Mevin Babu, Commented Jan 8, 2013 at 14:39

Fredrick Brennan · Accepted Answer · 2013-01-08 14:43:43Z

5

Please, use the stdlib urlparse module, like this. Generally, I try to avoid regex unless it is absolutely necessary.

from urlparse import urlparse, urlunparse
>>> parsed = urlparse("http://www.exmaple.com/boo/a.php?a=jsd")
>>> scheme, netloc, path, params, query, fragment = parsed
>>> urlunparse((scheme,netloc,path.split('/')[1],'','',''))
'http://www.exmaple.com/boo'

answered Jan 8, 2013 at 14:43

Fredrick Brennan

7,3533 gold badges35 silver badges67 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

georg Over a year ago

Yes, however, the path.split part requires some tune-up (check out http://www.exmaple.com/).

clement g · Accepted Answer · 2013-01-08 14:53:37Z

1

I would do something like that:

>>> import re
>>> url = "http://www.exmaple.com/boo/a.php?a=jsd"
>>> url[:url.rfind("/")+1]
'http://www.exmaple.com/boo/'

To remove everything that is after the last "/". I am not sure it covers all special cases though...

EDIT: New solution using urlparse and my simple rfind:

import re, urlparse
def url_cutter(url):
    up = urlparse.urlparse(url)
    url2 = up[0]+"://"+up[1]+up[2]
    if url.rfind("/")>6:
            url2 = url2[:url2.rfind("/")+1]
    return url2

Then:

In [36]: url_cutter("http://www.exmaple.com/boo/a.php?a=jsd")
Out[36]: 'http://www.exmaple.com/boo/'

In [37]: url_cutter("http://www.exmaple.com/boo/a.php?a=jsd#dvt_on")
Out[37]: 'http://www.exmaple.com/boo/'

In [38]: url_cutter("http://www.exmaple.com")
Out[38]: 'http://www.exmaple.com'

edited Jan 8, 2013 at 14:53

answered Jan 8, 2013 at 14:25

clement g

4511 gold badge3 silver badges10 bronze badges

2 Comments

clement g Over a year ago

Indeed @MevinBabu, one can add a simple test like if url.rfind("/")>6 to avoid this case.

jfs Over a year ago

It fails if the url has #fragment/with/slashes

Ketouem · Accepted Answer · 2013-01-08 15:27:57Z

0

There might be a more optimized way to do it but with this one you won't need an obscure import or third party package.

url = "http://www.google.com/abc/abc.html?q=test"
cleaned_url = url[:url.rindex("?")]
cleaned_url = cleaned_url.split("/")
cleaned_url = [item for item in cleaned_url if ".html" not in item]
cleaned_url = "/".join(cleaned_url)

answered Jan 8, 2013 at 15:27

Ketouem

3,8571 gold badge21 silver badges29 bronze badges

1 Comment

Ketouem Over a year ago

You might want to test if url.rindex sends you back an error in case "?" is not present in the string

Collectives™ on Stack Overflow

Remove GET variables from URL in python

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related