8

Following Introduction to Computer Science track at Udacity, I'm trying to make a python script to extract links from page, below is the code I used:

I got the following error

NameError: name 'page' is not defined

Here is the code:

def get_page(page):
    try:
        import urllib
        return urllib.urlopen(url).read()
    except:
        return ''

start_link = page.find('<a href=')
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]

def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1:
        return (None, 0)
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return (url, end_quote)

(url, end_pos) = get_next_target(page)

page = page[end_pos:]

def print_all_links(page):
    while True:
        (url, end_pos) = get_next_target(page)
        if url:
            print(url)
            page = page[:end_pos]
        else:
            break

print_all_links(get_page("http://xkcd.com/"))
7
  • Well, you use page before you define it right after the definition of get_page. Commented Jan 5, 2016 at 11:06
  • when I define page="content", I got a zero results Commented Jan 5, 2016 at 11:09
  • 1
    I would use selenium like so: browser.find_elements_by_tag_name('a') Commented Jan 5, 2016 at 11:10
  • Where are you expecting page to magically come from, exactly? Or the url, for that matter? Commented Jan 5, 2016 at 11:23
  • starting with line 8 you use the variable page, but where is it defined? Commented Jan 5, 2016 at 11:25

3 Answers 3

23

page is undefined and that is the cause of error.

For web scraping like this, you can simply use beautifulSoup:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://stackoverflow.com/"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))
Sign up to request clarification or add additional context in comments.

Comments

6

You can find all instances of tags that have an attribute containing http in htmlpage. This can be achieved using find_all method from BeautifulSoup and passing attrs={'href': re.compile("http")}

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlpage, 'html.parser')
links = []
for link in soup.find_all(attrs={'href': re.compile("http")}):
    links.append(link.get('href'))

print(links)

1 Comment

Not my vote, but I removed the comment. I encountered the question in the Low-Quality Review queue. Voting is not possible there.
2

I'm a bit late here, but here's one way to get the links off a given page:

from html.parser import HTMLParser
import urllib.request


class LinkScrape(HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr in attrs:
                if attr[0] == 'href':
                    link = attr[1]
                    if link.find('http') >= 0:
                        print('- ' + link)


if __name__ == '__main__':
    url = input('Enter URL > ')
    request_object = urllib.request.Request(url)
    page_object = urllib.request.urlopen(request_object)
    link_parser = LinkScrape()
    link_parser.feed(page_object.read().decode('utf-8'))

1 Comment

dear Mana - many thanks for this explanation - this is very helpful.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.