Recursion in python web crawler

Question

I am attempting to make a little web crawler in python. What seems to be tripping me up right now is the recursive part and depth of this problem. Given a url and a maxDepth of how many sites from there I want to link to I then add the url to the set of searched sites, and download all the text and links from the site. For all the links that the url has in it, I want to search each link and get it's words and links. What the problem is, is that when I go to recursively call the next url, the depth is already at maxDepth and it stops after going to only 1 more page. Hopefully I explained it decently, basically the question I am asking is how do I do all the recursive calls and then set self._depth += 1?

def crawl(self,url,maxDepth):        

    self._listOfCrawled.add(url)

    text = crawler_util.textFromURL(url).split()

    for each in text:
        self._index[each] = url

    links = crawler_util.linksFromURL(url)

    if self._depth < maxDepth:
        self._depth = self._depth + 1
        for i in links:
            if i not in self._listOfCrawled:
                self.crawl(i,maxDepth)

You should check out Scrapy - It's all open source, so you can look at the design for some ideas.. — Mike Christensen
– Mike Christensen, Commented Sep 18, 2012 at 20:03

tobias_k · Accepted Answer · 2012-09-18 21:02:22Z

3

The problem with your code is that you increase self.depth each time you call the function, and since it is a variable of the instance, it stays increased in the following calls. Let's say maxDepth is 3 and you have a URL A that links to pages B, and C, and B links to D, and C has a link to E. Your call hierarchy then looks like this (assuming that self._depth is 0 at the beginning):

crawl(self, A, 3)          # self._depth set to 1, following links to B and C
    crawl(self, B, 3)      # self._depth set to 2, following link to D
        crawl(self, D, 3)  # self._depth set to 3, no links to follow
    crawl(self, C, 3)      # self._depth >= maxDepth, skipping link to E

In other words, instead of the depth of the current call, you track the accumulated number of calls to crawl.

Instead, try something like this:

def crawl(self,url,depthToGo):
    # call this method with depthToGo set to maxDepth
    self._listOfCrawled.add(url)
    text = crawler_util.textFromURL(url).split()
    for each in text:
        # if word not in index, create a new set, then add URL to set
        if each not in self._index:
            self._index[each] = set([])
        self._index[each].add(url)
    links = crawler_util.linksFromURL(url)
    # check if we can go deeper
    if depthToGo > 0:
        for i in links:
            if i not in self._listOfCrawled:
                # decrease depthToGo for next level of recursion
                self.crawl(i, depthToGo - 1)

edited Sep 18, 2012 at 21:02

answered Sep 18, 2012 at 20:45

tobias_k

83.1k12 gold badges130 silver badges186 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user1647372 Over a year ago

Thank you for clearing that up for me! I actually have another problem...not sure if you can help but.. I have realized that when I assign the url to the value of each text, and then go to another url with the same word, the original url gets replaced by the last one. Is there a way to make the value a list of the urls?

tobias_k Over a year ago

I fixed that in my answer, but using a set instead of a list. You might also consider using a set for the URLs already crawled, since lookups in sets are much faster.

Collectives™ on Stack Overflow

Recursion in python web crawler

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related