1

I am trying to create a program to pull all the links from a webpage and put them into a list.

import urllib.request as ur

#user defined functions
def findLinks(website):
    links = []
    line = website.readline()
    while 'href=' not in line: 
        line = website.readline() 
        p
    while '</a>' not in line :
        links.append(line)
        line = website.readline()



#connect to a URL
website = ur.urlopen("https://www.cs.ualberta.ca/")
findLinks(website)

When I run this program it delays and returns a TypeError : string does not support buffer interference.

Anyone with any pointers?

3
  • Which version of python? Commented Jan 12, 2016 at 16:36
  • There are many tools to make this much easier, you are making an assumption that there are line breaks in the html, or that the link does not have a line break in it. You should Google, finding links Python - that should bring you back to some useful q&a here. Commented Jan 12, 2016 at 16:41
  • 3
    Possible duplicate of how can I get href links from html code Commented Jan 12, 2016 at 16:42

2 Answers 2

0

Python cannot use bytes with strings, to make it "work" I had to change "href=" to b"href=" and "</a>" to b"</a>".
The links were not extracted, though. Using re, I was able to do this:

def findthem(website):
    import re

    links = []
    line = website.readline()
    while len(line) != 0:
        req = re.findall('href="(.*?)"', line.decode())
        for l in req:
            links.append(l)

        line = website.readline()

    return links
Sign up to request clarification or add additional context in comments.

6 Comments

Oh nice post, I was looking at an easy way, but I don't really know any other solutions except by reading other stackoverflow posts. Thank you.
Yeah, that's one to bookmark. People on here get really upset whenever you suggest using regex to parse HTML.
Thank you, that fixed the problem! for future reference, why was it that the other method wouldn't work?
The code returned a list of lines containing links, not the links themselves, and the script read all the lines until it read an href, then continued but appending every line which did not contain an </a>. And when you make something like that, you should take in consideration that not every html page is written with indentation, newlines etc... This is why using html/xml parsers are recommended, because they are much more efficient.
one last question... for the link <a href="example.com/tillie" class="sister" id="link3">Tillie</a>; How would I go about extracting specifically the part that says 'Tillie" before the </a>?
|
0

A better way to get all the links from a URL would be to parse the HTML using a library like BeautifulSoup.

Here's an example that grabs all links from a URL and prints them.

import requests
from bs4 import BeautifulSoup

html = requests.get("https://www.cs.ualberta.ca/").text
soup = BeautifulSoup(html, "html.parser")

for a in soup.find_all("a"):
    link = a.get("href")
    if link:
        print(link)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.