3

I am attempting to make a little web crawler in python. What seems to be tripping me up right now is the recursive part and depth of this problem. Given a url and a maxDepth of how many sites from there I want to link to I then add the url to the set of searched sites, and download all the text and links from the site. For all the links that the url has in it, I want to search each link and get it's words and links. What the problem is, is that when I go to recursively call the next url, the depth is already at maxDepth and it stops after going to only 1 more page. Hopefully I explained it decently, basically the question I am asking is how do I do all the recursive calls and then set self._depth += 1?

def crawl(self,url,maxDepth):        

    self._listOfCrawled.add(url)

    text = crawler_util.textFromURL(url).split()

    for each in text:
        self._index[each] = url

    links = crawler_util.linksFromURL(url)

    if self._depth < maxDepth:
        self._depth = self._depth + 1
        for i in links:
            if i not in self._listOfCrawled:
                self.crawl(i,maxDepth) 
1
  • 4
    You should check out Scrapy - It's all open source, so you can look at the design for some ideas.. Commented Sep 18, 2012 at 20:03

1 Answer 1

3

The problem with your code is that you increase self.depth each time you call the function, and since it is a variable of the instance, it stays increased in the following calls. Let's say maxDepth is 3 and you have a URL A that links to pages B, and C, and B links to D, and C has a link to E. Your call hierarchy then looks like this (assuming that self._depth is 0 at the beginning):

crawl(self, A, 3)          # self._depth set to 1, following links to B and C
    crawl(self, B, 3)      # self._depth set to 2, following link to D
        crawl(self, D, 3)  # self._depth set to 3, no links to follow
    crawl(self, C, 3)      # self._depth >= maxDepth, skipping link to E

In other words, instead of the depth of the current call, you track the accumulated number of calls to crawl.

Instead, try something like this:

def crawl(self,url,depthToGo):
    # call this method with depthToGo set to maxDepth
    self._listOfCrawled.add(url)
    text = crawler_util.textFromURL(url).split()
    for each in text:
        # if word not in index, create a new set, then add URL to set
        if each not in self._index:
            self._index[each] = set([])
        self._index[each].add(url)
    links = crawler_util.linksFromURL(url)
    # check if we can go deeper
    if depthToGo > 0:
        for i in links:
            if i not in self._listOfCrawled:
                # decrease depthToGo for next level of recursion
                self.crawl(i, depthToGo - 1) 
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for clearing that up for me! I actually have another problem...not sure if you can help but.. I have realized that when I assign the url to the value of each text, and then go to another url with the same word, the original url gets replaced by the last one. Is there a way to make the value a list of the urls?
I fixed that in my answer, but using a set instead of a list. You might also consider using a set for the URLs already crawled, since lookups in sets are much faster.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.