How to get an HTML file using Python?

Question

I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.

How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?

Siegfried Gevatter · Accepted Answer · 2014-04-07 16:58:06Z

24

Example using urlib and lxml.html:

import urllib
from lxml import html

url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),
    ]

edited Apr 7, 2014 at 16:58

Siegfried Gevatter

3,7043 gold badges21 silver badges13 bronze badges

answered Dec 20, 2010 at 17:21

Vince Spicer

4,5603 gold badges23 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sebast26 Over a year ago

in python 3 you should import urllib.request and use urllib.request.urlopen function. see docs.python.org/3.2/library/…

User Over a year ago

urllib is outdated in this day and age and should be using the requests library or something that handles the modern day issues.

Miere · Accepted Answer · 2010-12-20 15:17:50Z

7

I think "eyquem" way would be my choice too, but I like to use httplib2 instead of urllib. urllib2 is too low level lib for this work.

import httplib2, re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
http = httplib2.Http()
headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")

li = pat.findall(body)
print li

answered Dec 20, 2010 at 15:17

Miere

1,5952 gold badges19 silver badges24 bronze badges

Comments

user225312 · Accepted Answer · 2010-12-20 12:22:22Z

6

Use urllib2 to get the page.
Use BeautifulSoup to parse the HTML (the page) and get what you want!

answered Dec 20, 2010 at 12:22

user225312

133k71 gold badges176 silver badges182 bronze badges

Comments

pulsedia · Accepted Answer · 2014-02-01 21:19:43Z

6

Check this my friend

import urllib.request

import re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'

sock = urllib.request.urlopen(url).read().decode("utf-8")

li = pat.findall(sock)

print(li)

answered Feb 1, 2014 at 21:19

pulsedia

611 silver badge2 bronze badges

Comments

eyquem · Accepted Answer · 2010-12-20 14:40:16Z

4

Or go straight forward:

import urllib

import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.urlopen(url)
li = pat.findall(sock.read())
sock.close()

print li

answered Dec 20, 2010 at 14:40

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

Comments

Community · Accepted Answer · 2017-05-23 12:31:56Z

1

And respect robots.txt and throttle your requests :)

(Apparently urllib2 does already according to this helpful SO post).

edited May 23, 2017 at 12:31

CommunityBot

11 silver badge

answered Dec 20, 2010 at 12:25

Tim Barrass

4,9892 gold badges33 silver badges57 bronze badges

2 Comments

Zippo Over a year ago

Is it illegal to not do so? ^.^

Tim Barrass Over a year ago

No, unless I misinterpret the multiple negatives there. :)

BCJuan · Accepted Answer · 2021-02-27 12:20:41Z

1

A more concise answer adapted to Python 3.x and using requests and bs4. There are two questions though in the original question. First, how to obtain the html:

import requests
html = requests.get("http://www.infolanka.com/miyuru_gee/art/art.html").content

Second, how to obtain artists name list:

import bs4
soup = bs4.BeautifulSoup(html)
artist_list = []
for i in soup.find_all("a"):
    if i.parent.name == "dt":
        artist_list.append(i.contents[0])
print(artist_list)

Output:

['Aathma Liyanage',
 'Abewardhana Balasuriya',
 'Aelian Thilakeratne',
 'Ahamed Mohideen',
 'Ajantha Nakandala',
 'Ajith Ambalangoda',
 'Ajith Ariayaratne',
 'Ajith Muthukumarana',
 'Ajith Paranawithana',
...]

answered Feb 27, 2021 at 12:20

BCJuan

8359 silver badges17 bronze badges

Comments

SysMurff · Accepted Answer · 2016-11-07 09:05:33Z

Basically, there's a function call:

render_template()

You can easly return single page or list of pages with it and it reads all files automaticaly from a your_workspace\templates .

Example:

/root_dir /templates /index1.html, /index2.html /other_dir /

routes.py

@app.route('/') def root_dir(): return render_template('index1.html')

@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)

index1.html - without params

<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>

index2.html - with params

<html> <body>  {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>

Collectives™ on Stack Overflow

How to get an HTML file using Python?

8 Answers 8

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

2 Comments

Comments

Comments

Comments

Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related