Trying to create a python web crawler

Question

I am trying to implement a web crawler with python. Here is what I have so far:

import urllib2

seed=raw_input('Enter a url : ')

def getAllNewLinksOnPage(page,prevLinks):

response = urllib2.urlopen(page)
html = response.read()

links,pos,allFound=[],0,False
while not allFound:
    aTag=html.find("<a href=",pos)
    if aTag>-1:
        href=html.find('"',aTag+1)
        endHref=html.find('"',href+1)
        url=html[href+1:endHref]
        if url[:7]=="http://":
            if url[-1]=="/":
                url=url[:-1]
            if not url in links and not url in prevLinks:
                links.append(url)     
                print url
        closeTag=html.find("</a>",aTag)
        pos=closeTag+1
    else:
        allFound=True   
return links

toCrawl=[seed]
crawled=[]
while toCrawl:
url=toCrawl.pop()
crawled.append(url)
newLinks=getAllNewLinksOnPage(url,crawled)
toCrawl=list(set(toCrawl)|set(newLinks))

print crawled

I was wondering how I would go about implementing a depth search and sorting the results.

Please fix the indentation, and include any missing code (if necessary) to make this a runnable example, so we can tell what you're missing. — abarnert
– abarnert, Commented Nov 5, 2013 at 21:10
Parsing HTML and finding links is a whole lot easier with an HTML parser, like the ones built into the stdlib, or a third-party library like BeautifulSoup or lxml. But it looks like your getAllNewLinksOnPage is not the part you're having problems with, right? — abarnert
– abarnert, Commented Nov 5, 2013 at 21:22

abarnert · Accepted Answer · 2013-11-05 21:30:01Z

What you've implemented so far is a sort of random-order search, because you're keeping a set of links to crawl. (You're actually keeping a list, but converting it back and forth to a set repeatedly, which scrambles whatever order you had.)

To turn this into a depth-first search, the usual solution is to do it recursively. Then you don't need any external storage of links to crawl. You do still need to keep track of links crawled so far—both to avoid repeats, and because you want to sort the links at the end (which requires having something to sort), but that's it. So:

def crawl(seed):
    crawled = set()
    def crawl_recursively(link):
        if link in crawled:
            return
        newLinks = getAllLinksOnPage(link)
        crawled.add(seed)
        for link in newLinks:
            crawl_recursively(link)
    crawl_recursively(seed)
    return sorted(crawled)

If you don't want to use recursion, the alternative is to use an explicit stack of links to crawl. But you can't keep reorganizing that stack, or it's not a stack anymore. Again, a separate set of already-crawled links will solve the problem of avoiding repeated lookups.

def crawl(seed):
    crawled = set()
    to_crawl = [seed]
    while to_crawl:
        link = to_crawl.pop()
        if link in crawled:
            continue
        crawled.add(link)
        newLinks = getAllLinksOnPage(link)
        to_crawl.extend(newLinks)
    return sorted(crawled)

Turning the stack into a queue (just change one line to to_crawl.pop(0)) makes this a breadth-first search.

If you're worried about to_crawl getting too big because it's full of repeats, you can strip them out as you go. Since you wanted to go depth-first, you want to strip out ones that are stacked up for later, not the new ones. The simplest way to do this is probably with a OrderedSet (e.g., the recipe linked from the collections docs):

    to_crawl = OrderedSet()
    # …
        new_link_set = OrderedSet(newLinks)
        to_crawl -= new_link_set
        to_crawl |= new_link_set - crawled

Collectives™ on Stack Overflow

Trying to create a python web crawler

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related