19

I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.

How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?

8 Answers 8

24

Example using urlib and lxml.html:

import urllib
from lxml import html

url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),
    ]
Sign up to request clarification or add additional context in comments.

2 Comments

in python 3 you should import urllib.request and use urllib.request.urlopen function. see docs.python.org/3.2/library/…
urllib is outdated in this day and age and should be using the requests library or something that handles the modern day issues.
7

I think "eyquem" way would be my choice too, but I like to use httplib2 instead of urllib. urllib2 is too low level lib for this work.

import httplib2, re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>') http = httplib2.Http() headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")
li = pat.findall(body) print li

Comments

6
  1. Use urllib2 to get the page.

  2. Use BeautifulSoup to parse the HTML (the page) and get what you want!

Comments

6

Check this my friend

import urllib.request

import re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'

sock = urllib.request.urlopen(url).read().decode("utf-8")

li = pat.findall(sock)

print(li)

Comments

4

Or go straight forward:

import urllib

import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.urlopen(url)
li = pat.findall(sock.read())
sock.close()

print li

Comments

1

And respect robots.txt and throttle your requests :)

(Apparently urllib2 does already according to this helpful SO post).

2 Comments

Is it illegal to not do so? ^.^
No, unless I misinterpret the multiple negatives there. :)
1

A more concise answer adapted to Python 3.x and using requests and bs4. There are two questions though in the original question. First, how to obtain the html:

import requests
html = requests.get("http://www.infolanka.com/miyuru_gee/art/art.html").content

Second, how to obtain artists name list:

import bs4
soup = bs4.BeautifulSoup(html)
artist_list = []
for i in soup.find_all("a"):
    if i.parent.name == "dt":
        artist_list.append(i.contents[0])
print(artist_list)

Output:

['Aathma Liyanage',
 'Abewardhana Balasuriya',
 'Aelian Thilakeratne',
 'Ahamed Mohideen',
 'Ajantha Nakandala',
 'Ajith Ambalangoda',
 'Ajith Ariayaratne',
 'Ajith Muthukumarana',
 'Ajith Paranawithana',
...]

Comments

0

Basically, there's a function call:

render_template()

You can easly return single page or list of pages with it and it reads all files automaticaly from a your_workspace\templates .

Example:

/root_dir /templates /index1.html, /index2.html /other_dir /

routes.py

@app.route('/') def root_dir(): return render_template('index1.html')

@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)

index1.html - without params

<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>

index2.html - with params

<html> <body> <!-- Built-it conditional functions in the framework templates in Flask --> {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.