Extract all links from a web page using python

Question

Following Introduction to Computer Science track at Udacity, I'm trying to make a python script to extract links from page, below is the code I used:

I got the following error

NameError: name 'page' is not defined

Here is the code:

def get_page(page):
    try:
        import urllib
        return urllib.urlopen(url).read()
    except:
        return ''

start_link = page.find('<a href=')
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]

def get_next_target(page):
    start_link = page.find('<a href=')
    if start_link == -1:
        return (None, 0)
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    return (url, end_quote)

(url, end_pos) = get_next_target(page)

page = page[end_pos:]

def print_all_links(page):
    while True:
        (url, end_pos) = get_next_target(page)
        if url:
            print(url)
            page = page[:end_pos]
        else:
            break

print_all_links(get_page("http://xkcd.com/"))

Well, you use page before you define it right after the definition of get_page. — timgeb
– timgeb, Commented Jan 5, 2016 at 11:06
I would use selenium like so: browser.find_elements_by_tag_name('a') — Malik Brahimi
– Malik Brahimi, Commented Jan 5, 2016 at 11:10
Where are you expecting page to magically come from, exactly? Or the url, for that matter? — jonrsharpe
– jonrsharpe, Commented Jan 5, 2016 at 11:23
starting with line 8 you use the variable page, but where is it defined? — kotlet schabowy
– kotlet schabowy, Commented Jan 5, 2016 at 11:25

Cyrbil · Accepted Answer · 2016-01-05 11:37:13Z

23

page is undefined and that is the cause of error.

For web scraping like this, you can simply use beautifulSoup:

from bs4 import BeautifulSoup, SoupStrainer
import requests

url = "http://stackoverflow.com/"

page = requests.get(url)    
data = page.text
soup = BeautifulSoup(data)

for link in soup.find_all('a'):
    print(link.get('href'))

edited Jan 5, 2016 at 11:37

Cyrbil

6,5081 gold badge27 silver badges42 bronze badges

answered Jan 5, 2016 at 11:28

Hossein Rashno

3,5782 gold badges28 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Francesco Pegoraro · Accepted Answer · 2019-03-29 08:39:02Z

6

You can find all instances of tags that have an attribute containing http in htmlpage. This can be achieved using find_all method from BeautifulSoup and passing attrs={'href': re.compile("http")}

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlpage, 'html.parser')
links = []
for link in soup.find_all(attrs={'href': re.compile("http")}):
    links.append(link.get('href'))

print(links)

edited Mar 29, 2019 at 8:39

answered Mar 28, 2019 at 9:50

Francesco Pegoraro

83014 silver badges37 bronze badges

1 Comment

BDL Over a year ago

Not my vote, but I removed the comment. I encountered the question in the Low-Quality Review queue. Voting is not possible there.

mana · Accepted Answer · 2020-09-05 14:06:49Z

2

I'm a bit late here, but here's one way to get the links off a given page:

from html.parser import HTMLParser
import urllib.request


class LinkScrape(HTMLParser):

    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for attr in attrs:
                if attr[0] == 'href':
                    link = attr[1]
                    if link.find('http') >= 0:
                        print('- ' + link)


if __name__ == '__main__':
    url = input('Enter URL > ')
    request_object = urllib.request.Request(url)
    page_object = urllib.request.urlopen(request_object)
    link_parser = LinkScrape()
    link_parser.feed(page_object.read().decode('utf-8'))

edited Sep 5, 2020 at 14:06

answered Sep 5, 2020 at 13:42

mana

1291 silver badge6 bronze badges

1 Comment

zero Over a year ago

dear Mana - many thanks for this explanation - this is very helpful.

Collectives™ on Stack Overflow

Extract all links from a web page using python

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related