0

I am self-teaching myself Python and came up with building a simple web-crawler engine. the codes are below,

def find_next_url(page):
    start_of_url_line = page.find('<a href')
    if start_of_url_line == -1:
        return None, 0
else:
    start_of_url = page.find('"http', start_of_url_line)
    if start_of_url == -1:
        return None, 0
    else:
        end_of_url = page.find('"', start_of_url + 1)
        one_url = page[start_of_url + 1 : end_of_url]
        return one_url, end_of_url 

def get_all_url(page):
p = []
while True:
    url, end_pos = find_next_url(page)
    if url:
        p.append(url)
        page = page[end_pos + 1 : ]
    else:
        break
return p

def union(a, b):
    for e in b:
    if e not in a:
        a.append(e)
    return a

def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            import urllib.request
            intpage = urllib.request.urlopen(page).read()
            openpage = str(intpage)
            union(tocrawl, get_all_url(openpage))
            crawled.append(page)
    return crawled

However I am always getting HTTP 403 error.

2
  • 2
    403 means Forbidden - w/o knowing what url(s) you're trying to access it's hard to say if this is a desired behavior. Commented Nov 28, 2017 at 13:28
  • what I am trying to achieve is to see if the code can fetch some URL from one page and then go into each individual URL and fetch more URLs inside the earlier found list of URLs. I will probably achieve this if I have a simple webpage with some HTTP hyperlinks which then will give me further URLs and stop there. I tried with xkcd.com/353. Commented Nov 28, 2017 at 14:02

3 Answers 3

1

HTTP 403 error is not related to your code. It means URL being crawled is forbidden to access. Most of the time it means the page is only available to logged in users or a specific user.


I actually ran your code and got 403 with creativecommons link. The reason is urllib does not send Host header by default and you should add it manually to not get the error (Most servers will check the Host header and decide which content they should serve). You could also use Requests python package instead of builtin urllib that sends Host header by default and is more pythonic IMO.

I add a try-exept clause to catch and log errors then continue to crawl other links. There are a lot of broken links on the web.

from urllib.request import urlopen
from urllib.error import HTTPError
...
def webcrawl(seed):
    tocrawl = [seed]
    crawled = []
    while True:
        page = tocrawl.pop()
        if page not in crawled:
            try:
                intpage = urlopen(page).read()
                openpage = str(intpage)
                union(tocrawl, get_all_url(openpage))
                crawled.append(page)
            except HTTPError as ex:
                print('got http error while crawling', page)
    return crawled
Sign up to request clarification or add additional context in comments.

5 Comments

what I am trying to achieve is to see if the code can fetch some URL from one page and then go into each individual URL and fetch more URLs inside the earlier found list of URLs. I will probably achieve this if I have a simple webpage with some HTTP hyperlinks which then will give me further URLs and stop there.
Try to find the exact URL that cause 403 error and add it to your question. It's more likely that the URL is the problem. Try to print URL before urlopen call.
i found the URL from the first set of list - creativecommons.org/licenses/by-nc/2.5
Now because of this one, the code is stopping and not going to another cycle which i desperately want it to.
@Sayan Updated my answer. I skipped creativecommos and it works for some links.
1

You might need to add request headers or other authentication. Try adding user agents to avoid in some cases reCaptcha.

Example:

    User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36

Comments

0

As others have said, the error is not caused by the code itself, but you may want to try to do a couple things

  • Try adding exception handlers, maybe even ignore the problematic pages altogether for now to make sure the crawler is working as expected otherwise:

    def webcrawl(seed):
        tocrawl = [seed]
        crawled = []
        while tocrawl: # replace `while True` with an actual condition,
                       # otherwise you'll be stuck in an infinite loop
                       # until you hit an exception
            page = tocrawl.pop()
            if page not in crawled:
                import urllib.request
                try:
                    intpage = urllib.request.urlopen(page).read()
                    openpage = str(intpage)
                    union(tocrawl, get_all_url(openpage))
                    crawled.append(page)
                except urllib.error.HTTPError as e:  # catch an exception
                    if e.code == 401:  # check the status code and take action
                        pass  # or anything else you want to do in case of an `Unauthorized` error
                    elif e.code == 403:
                        pass  # or anything else you want to do in case of a `Forbidden` error
                    elif e.cide == 404:
                        pass   # or anything else you want to do in case of a `Not Found` error
                    # etc
                    else:
                        print('Exception:\n{}'.format(e))  # print an unexpected exception
                        sys.exit(1)  # finish the process with exit code 1 (indicates there was a problem)
        return crawled
    
  • Try adding a User-Agent header into your request. From urllib.request docs:

This is often used to “spoof” the User-Agent header, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib‘s default user agent string is "Python-urllib/2.6" (on Python 2.6).

So something like this might help to get around some of the 403 errors:

    headers = {'User-Agent':'Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11'}
    req = urllib.request.Request(page, headers=headers)
    intpage = urllib.request.urlopen(req).read()
    openpage = str(intpage)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.