How to extract links from HTML (with python)

Question

so I've downloaded the HTML of a web page. I'm supposed to extract all of the links from the HTML and output them. Here is my code

f = open('html.py','r')
heb = f.readlines()
arry = []
if 'href' in heb:
    arry = arry.append(href)

    print(arry)

I'm trying to make a list of the links and output it, but honestly I'm pretty lost. Can someone point me in the right direction? I was thinking regex probably is the way to go thanks

Don't use regex on html! Use an HTML parser like BeautifulSoup. — kevinsa5
– kevinsa5, Commented Jun 20, 2017 at 1:39
Possible duplicate of retrieve links from web page using python and BeautifulSoup — Teemu Risikko
– Teemu Risikko, Commented Jun 20, 2017 at 5:57

icktoofay · Accepted Answer · 2017-06-20 02:02:18Z

3

You can use Beautiful Soup (which you'll need to install, e.g. with pip install BeautifulSoup4):

import bs4

with open("my-file.html") as f:
    soup = bs4.BeautifulSoup(f)

links = [link['href'] for link in soup('a') if 'href' in link.attrs]

answered Jun 20, 2017 at 2:02

icktoofay

130k23 gold badges261 silver badges239 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to extract links from HTML (with python)

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related