2

I am trying to implement a web crawler with python. Here is what I have so far:

import urllib2

seed=raw_input('Enter a url : ')

def getAllNewLinksOnPage(page,prevLinks):

response = urllib2.urlopen(page)
html = response.read()

links,pos,allFound=[],0,False
while not allFound:
    aTag=html.find("<a href=",pos)
    if aTag>-1:
        href=html.find('"',aTag+1)
        endHref=html.find('"',href+1)
        url=html[href+1:endHref]
        if url[:7]=="http://":
            if url[-1]=="/":
                url=url[:-1]
            if not url in links and not url in prevLinks:
                links.append(url)     
                print url
        closeTag=html.find("</a>",aTag)
        pos=closeTag+1
    else:
        allFound=True   
return links

toCrawl=[seed]
crawled=[]
while toCrawl:
url=toCrawl.pop()
crawled.append(url)
newLinks=getAllNewLinksOnPage(url,crawled)
toCrawl=list(set(toCrawl)|set(newLinks))

print crawled   

I was wondering how I would go about implementing a depth search and sorting the results.

5
  • have you seen scrapy? scrapy.org Commented Nov 5, 2013 at 20:42
  • if you want to roll your own and learn udacity example Commented Nov 5, 2013 at 20:52
  • Please fix the indentation, and include any missing code (if necessary) to make this a runnable example, so we can tell what you're missing. Commented Nov 5, 2013 at 21:10
  • Parsing HTML and finding links is a whole lot easier with an HTML parser, like the ones built into the stdlib, or a third-party library like BeautifulSoup or lxml. But it looks like your getAllNewLinksOnPage is not the part you're having problems with, right? Commented Nov 5, 2013 at 21:22
  • I think you should use xpath and lxml at first Commented Nov 5, 2013 at 23:06

1 Answer 1

4

What you've implemented so far is a sort of random-order search, because you're keeping a set of links to crawl. (You're actually keeping a list, but converting it back and forth to a set repeatedly, which scrambles whatever order you had.)

To turn this into a depth-first search, the usual solution is to do it recursively. Then you don't need any external storage of links to crawl. You do still need to keep track of links crawled so far—both to avoid repeats, and because you want to sort the links at the end (which requires having something to sort), but that's it. So:

def crawl(seed):
    crawled = set()
    def crawl_recursively(link):
        if link in crawled:
            return
        newLinks = getAllLinksOnPage(link)
        crawled.add(seed)
        for link in newLinks:
            crawl_recursively(link)
    crawl_recursively(seed)
    return sorted(crawled)

If you don't want to use recursion, the alternative is to use an explicit stack of links to crawl. But you can't keep reorganizing that stack, or it's not a stack anymore. Again, a separate set of already-crawled links will solve the problem of avoiding repeated lookups.

def crawl(seed):
    crawled = set()
    to_crawl = [seed]
    while to_crawl:
        link = to_crawl.pop()
        if link in crawled:
            continue
        crawled.add(link)
        newLinks = getAllLinksOnPage(link)
        to_crawl.extend(newLinks)
    return sorted(crawled)

Turning the stack into a queue (just change one line to to_crawl.pop(0)) makes this a breadth-first search.

If you're worried about to_crawl getting too big because it's full of repeats, you can strip them out as you go. Since you wanted to go depth-first, you want to strip out ones that are stacked up for later, not the new ones. The simplest way to do this is probably with a OrderedSet (e.g., the recipe linked from the collections docs):

    to_crawl = OrderedSet()
    # …
        new_link_set = OrderedSet(newLinks)
        to_crawl -= new_link_set
        to_crawl |= new_link_set - crawled
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.