0

Working on a small web spider in python, using the lxml module I have a segment of code which does an xpath query of the document and places all the links from 'a href' tags into a list. what I'd like to do is check each link as it is being added to the list, and if it is needed, unescape it. I understand using the urllib.unquote() function, but the problem I'm experiencing is that the urllib method throws an exception which I believe is due to not every link that is passed to the method needs unescaping. Can anyone point me in the right direction? Here's the code I have so far:

import urllib
import urllib2
from lxml.html import parse, tostring

class Crawler():

    def __init__(self, url):
        self.url = url
        self.links = []
    def crawl(self):

        doc = parse("http://" + self.url).getroot()
        doc.make_links_absolute(self.url, resolve_base_href=True)
        for tag in doc.xpath("//a"):
            old = tag.get('href')
            fixed = urllib.unquote(old)
            self.links.append(fixed)
        print(self.links)
5
  • You could put a try except around it Commented Oct 24, 2010 at 4:12
  • Do you have an example of a URI that raises an exception? I have tried some with and without escaping, and can't get it to fail. Commented Oct 24, 2010 at 4:16
  • Unescaping a URL that doesn't need it shouldn't raise an exception - something else is wrong here. Can you post an example of a URL that causes an exception? Commented Oct 24, 2010 at 4:20
  • use doc.xpath("//a[@href]") to exclude a elements without href attribute. Commented Oct 24, 2010 at 4:42
  • First rule of asking questions: include the stack trace you are getting. Commented Oct 24, 2010 at 12:00

3 Answers 3

1

unquote doesn't throw exceptions because of URLs that don't need escaping. You haven't shown us the exception, but I'll guess that the problem is that old isn't a string, it's probably None, because you have an <a> tag with no href attribute.

Check the value of old before you try to use it.

Sign up to request clarification or add additional context in comments.

1 Comment

This was it. I took a second look at the stack trace, and it was referencing a 'None' object. I made the changes to the xpath query as noted in the above comments, working great now. Thanks
0
url.find('%') > -1

or wrap urllib.unquote in a try..except clause.

2 Comments

The lack of a % does not actually cause unquote() to raise an exception.
I think '%' in url would be slightly more Pythonic.
0

You could do something like this. Although I don't have a url which causes an exception. So this is just hypothesis at this point. See if this approach works.

from urllib import unquote

#get url from your parse tree.
url_unq = unquote(url or '')
if not url_unq:
    url_unq = url

See if this works? It would be great if you could give an actual example of the URL which causes exception. What Exception? Could you post the StackTrace?

Worst-case you could always use a try-except around that block & go about your business.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.