I am attempting to make a little web crawler in python. What seems to be tripping me up right now is the recursive part and depth of this problem. Given a url and a maxDepth of how many sites from there I want to link to I then add the url to the set of searched sites, and download all the text and links from the site. For all the links that the url has in it, I want to search each link and get it's words and links. What the problem is, is that when I go to recursively call the next url, the depth is already at maxDepth and it stops after going to only 1 more page. Hopefully I explained it decently, basically the question I am asking is how do I do all the recursive calls and then set self._depth += 1?
def crawl(self,url,maxDepth):
self._listOfCrawled.add(url)
text = crawler_util.textFromURL(url).split()
for each in text:
self._index[each] = url
links = crawler_util.linksFromURL(url)
if self._depth < maxDepth:
self._depth = self._depth + 1
for i in links:
if i not in self._listOfCrawled:
self.crawl(i,maxDepth)