0

I got some pretty messy urls that i got via scraping here, problem is that they contain spaces or other special characters in the path and query string, here is some example

http://www.example.com/some path/to the/file.html
http://www.example.com/some path/?file=path to/file name.png&name=name.me

so, is there an easy and robust way to escape the urls so that i can pass them to urlopen? i tried urlib.quote, but it seems to escape the '?', '&', and '=' in the query string as well, and it seems to escape the protocol as well, currently, what i am trying to do is use regex to separate the protocol, path name, and query string and escape them separately, but there are cases where they arent separated properly any advice is appreciated

2
  • If the only problem is spaces, what's wrong with url_str.replace(' ', '%20')? Commented Jun 17, 2012 at 3:10
  • Dougal, there maybe a possibility of other characters that need to be encoded as well, i'll edit my question soon, Commented Jun 17, 2012 at 3:14

1 Answer 1

5

urllib.quote will quote everything except / by default. You can pass it a list of characters to leave alone as the second argument:

urllib.quote('http://www.example.com/some path/?file=path to/file name.png&name=name.me',
             '/:?&=')
'http://www.example.com/some%20path/?file=path%20to/file%20name.png&name=name.me'

But this is pretty tricky stuff to be messing with semimanually.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.