Simple web-crawler in Python

Question

I am self-teaching myself Python and came up with building a simple web-crawler engine. the codes are below,

def find_next_url(page):
    start_of_url_line = page.find('<a href')
    if start_of_url_line == -1:
        return None, 0
else:
    start_of_url = page.find('"http', start_of_url_line)
    if start_of_url == -1:
        return None, 0
    else:
        end_of_url = page.find('"', start_of_url + 1)
        one_url = page[start_of_url + 1 : end_of_url]
        return one_url, end_of_url 

def get_all_url(page):
p = []
while True:
    url, end_pos = find_next_url(page)
    if url:
        p.append(url)
        page = page[end_pos + 1 : ]
    else:
        break
return p

def union(a, b):
    for e in b:
    if e not in a:
        a.append(e)
    return a

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            intpage = urllib.request.urlopen(page).read()
            openpage = str(intpage)
            union(tocrawl, get_all_url(openpage))
            crawled.append(page)
    return crawled

However I am always getting HTTP 403 error.

403 means Forbidden - w/o knowing what url(s) you're trying to access it's hard to say if this is a desired behavior. — Mike Scotty
– Mike Scotty, Commented Nov 28, 2017 at 13:28
what I am trying to achieve is to see if the code can fetch some URL from one page and then go into each individual URL and fetch more URLs inside the earlier found list of URLs. I will probably achieve this if I have a simple webpage with some HTTP hyperlinks which then will give me further URLs and stop there. I tried with xkcd.com/353. — Sayan
– Sayan, Commented Nov 28, 2017 at 14:02

Arman Ordookhani · Accepted Answer · 2017-11-28 15:17:11Z

1

HTTP 403 error is not related to your code. It means URL being crawled is forbidden to access. Most of the time it means the page is only available to logged in users or a specific user.

I actually ran your code and got 403 with creativecommons link. The reason is urllib does not send Host header by default and you should add it manually to not get the error (Most servers will check the Host header and decide which content they should serve). You could also use Requests python package instead of builtin urllib that sends Host header by default and is more pythonic IMO.

I add a try-exept clause to catch and log errors then continue to crawl other links. There are a lot of broken links on the web.

from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            try:
                intpage = urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except HTTPError as ex:
                print('got http error while crawling', page)
    return crawled

edited Nov 28, 2017 at 15:17

answered Nov 28, 2017 at 13:31

Arman Ordookhani

6,5951 gold badge33 silver badges44 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Sayan Over a year ago

what I am trying to achieve is to see if the code can fetch some URL from one page and then go into each individual URL and fetch more URLs inside the earlier found list of URLs. I will probably achieve this if I have a simple webpage with some HTTP hyperlinks which then will give me further URLs and stop there.

Arman Ordookhani Over a year ago

Try to find the exact URL that cause 403 error and add it to your question. It's more likely that the URL is the problem. Try to print URL before urlopen call.

Sayan Over a year ago

i found the URL from the first set of list - creativecommons.org/licenses/by-nc/2.5

Sayan Over a year ago

Now because of this one, the code is stopping and not going to another cycle which i desperately want it to.

Arman Ordookhani Over a year ago

@Sayan Updated my answer. I skipped creativecommos and it works for some links.

RGH · Accepted Answer · 2017-11-29 09:01:43Z

1

You might need to add request headers or other authentication. Try adding user agents to avoid in some cases reCaptcha.

Example:

    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

edited Nov 29, 2017 at 9:01

answered Nov 28, 2017 at 14:03

RGH

195 bronze badges

Comments

rmq · Accepted Answer · 2017-11-28 15:40:16Z

As others have said, the error is not caused by the code itself, but you may want to try to do a couple things

Try adding exception handlers, maybe even ignore the problematic pages altogether for now to make sure the crawler is working as expected otherwise:

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while tocrawl: # replace `while True` with an actual condition,
                   # otherwise you'll be stuck in an infinite loop
                   # until you hit an exception
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            try:
                intpage = urllib.request.urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except urllib.error.HTTPError as e:  # catch an exception
                if e.code == 401:  # check the status code and take action
                    pass  # or anything else you want to do in case of an `Unauthorized` error
                elif e.code == 403:
                    pass  # or anything else you want to do in case of a `Forbidden` error
                elif e.cide == 404:
                    pass   # or anything else you want to do in case of a `Not Found` error
                # etc
                else:
                    print('Exception:\n{}'.format(e))  # print an unexpected exception
                    sys.exit(1)  # finish the process with exit code 1 (indicates there was a problem)
    return crawled

Try adding a User-Agent header into your request. From urllib.request docs:

This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

So something like this might help to get around some of the 403 errors:

    headers = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}
    req = urllib.request.Request(page, headers=headers)
    intpage = urllib.request.urlopen(req).read()
    openpage = str(intpage)

Collectives™ on Stack Overflow

Simple web-crawler in Python

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related