Pulling links from website in python

Question

I am trying to create a program to pull all the links from a webpage and put them into a list.

import urllib.request as ur

#user defined functions
def findLinks(website):
    links = []
    line = website.readline()
    while 'href=' not in line: 
        line = website.readline() 
        p
    while '</a>' not in line :
        links.append(line)
        line = website.readline()



#connect to a URL
website = ur.urlopen("https://www.cs.ualberta.ca/")
findLinks(website)

When I run this program it delays and returns a TypeError : string does not support buffer interference.

Anyone with any pointers?

There are many tools to make this much easier, you are making an assumption that there are line breaks in the html, or that the link does not have a line break in it. You should Google, finding links Python - that should bring you back to some useful q&a here. — PyNEwbie
– PyNEwbie, Commented Jan 12, 2016 at 16:41
Possible duplicate of how can I get href links from html code — PyNEwbie
– PyNEwbie, Commented Jan 12, 2016 at 16:42

Rolbrok · Accepted Answer · 2016-01-12 16:52:11Z

0

Python cannot use bytes with strings, to make it "work" I had to change "href=" to b"href=" and "</a>" to b"</a>".
The links were not extracted, though. Using re, I was able to do this:

def findthem(website):
    import re

    links = []
    line = website.readline()
    while len(line) != 0:
        req = re.findall('href="(.*?)"', line.decode())
        for l in req:
            links.append(l)

        line = website.readline()

    return links

answered Jan 12, 2016 at 16:52

Rolbrok

3281 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Rolbrok Over a year ago

Oh nice post, I was looking at an easy way, but I don't really know any other solutions except by reading other stackoverflow posts. Thank you.

OneCricketeer Over a year ago

Yeah, that's one to bookmark. People on here get really upset whenever you suggest using regex to parse HTML.

spaceinvaders101 Over a year ago

Thank you, that fixed the problem! for future reference, why was it that the other method wouldn't work?

Rolbrok Over a year ago

The code returned a list of lines containing links, not the links themselves, and the script read all the lines until it read an href, then continued but appending every line which did not contain an </a>. And when you make something like that, you should take in consideration that not every html page is written with indentation, newlines etc... This is why using html/xml parsers are recommended, because they are much more efficient.

spaceinvaders101 Over a year ago

one last question... for the link <a href="example.com/tillie" class="sister" id="link3">Tillie</a>; How would I go about extracting specifically the part that says 'Tillie" before the </a>?

|

Leo · Accepted Answer · 2016-01-12 17:43:16Z

0

A better way to get all the links from a URL would be to parse the HTML using a library like BeautifulSoup.

Here's an example that grabs all links from a URL and prints them.

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.cs.ualberta.ca/").text
soup = BeautifulSoup(html, "html.parser")

for a in soup.find_all("a"):
    link = a.get("href")
    if link:
        print(link)

answered Jan 12, 2016 at 17:43

Leo

1,3039 silver badges14 bronze badges

Collectives™ on Stack Overflow

Pulling links from website in python

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related