How do I programmatically check if a url needs unescaping in python?

Question

Working on a small web spider in python, using the lxml module I have a segment of code which does an xpath query of the document and places all the links from 'a href' tags into a list. what I'd like to do is check each link as it is being added to the list, and if it is needed, unescape it. I understand using the urllib.unquote() function, but the problem I'm experiencing is that the urllib method throws an exception which I believe is due to not every link that is passed to the method needs unescaping. Can anyone point me in the right direction? Here's the code I have so far:

import urllib
import urllib2
from lxml.html import parse, tostring

class Crawler():

    def __init__(self, url):
        self.url = url
        self.links = []
    def crawl(self):

        doc = parse("http://" + self.url).getroot()
        doc.make_links_absolute(self.url, resolve_base_href=True)
        for tag in doc.xpath("//a"):
            old = tag.get('href')
            fixed = urllib.unquote(old)
            self.links.append(fixed)
        print(self.links)

Do you have an example of a URI that raises an exception? I have tried some with and without escaping, and can't get it to fail. — kindall
– kindall, Commented Oct 24, 2010 at 4:16
Unescaping a URL that doesn't need it shouldn't raise an exception - something else is wrong here. Can you post an example of a URL that causes an exception? — Nick Bastin
– Nick Bastin, Commented Oct 24, 2010 at 4:20
use doc.xpath("//a[@href]") to exclude a elements without href attribute. — jfs
– jfs, Commented Oct 24, 2010 at 4:42
First rule of asking questions: include the stack trace you are getting. — Ned Batchelder
– Ned Batchelder, Commented Oct 24, 2010 at 12:00

Ned Batchelder · Accepted Answer · 2010-10-24 11:59:19Z

1

unquote doesn't throw exceptions because of URLs that don't need escaping. You haven't shown us the exception, but I'll guess that the problem is that old isn't a string, it's probably None, because you have an <a> tag with no href attribute.

Check the value of old before you try to use it.

answered Oct 24, 2010 at 11:59

Ned Batchelder

378k77 gold badges583 silver badges675 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Stev0 Over a year ago

This was it. I took a second look at the stack trace, and it was referencing a 'None' object. I made the changes to the xpath query as noted in the above comments, working great now. Thanks

intuited · Accepted Answer · 2010-10-24 04:14:21Z

0

url.find('%') > -1

or wrap urllib.unquote in a try..except clause.

answered Oct 24, 2010 at 4:14

intuited

24.2k9 gold badges72 silver badges89 bronze badges

2 Comments

kindall Over a year ago

The lack of a % does not actually cause unquote() to raise an exception.

Tyler Over a year ago

I think '%' in url would be slightly more Pythonic.

Srikar Appalaraju · Accepted Answer · 2010-10-24 04:30:33Z

0

You could do something like this. Although I don't have a url which causes an exception. So this is just hypothesis at this point. See if this approach works.

from urllib import unquote

#get url from your parse tree.
url_unq = unquote(url or '')
if not url_unq:
    url_unq = url

See if this works? It would be great if you could give an actual example of the URL which causes exception. What Exception? Could you post the StackTrace?

Worst-case you could always use a try-except around that block & go about your business.

answered Oct 24, 2010 at 4:30

Srikar Appalaraju

74k55 gold badges221 silver badges265 bronze badges

Collectives™ on Stack Overflow

How do I programmatically check if a url needs unescaping in python?

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related