Python HTML - Get element by attribute

Question

There is music website I regularly read, and it has a section where users post their own fictional music-related stories. There is a 91 part series (Written over a length of time, uploaded part by part) that always follows the convention of: http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html.

I would like to be able to get just the formatted text from every part and put it into one html file.

Conveniently, there is a link to a print version, correctly formatted for my purposes. All I would have to do is write a script to download all of the parts and then dump them into file. Not hard.

Unfortunately, the url for a print version is as follows: www.ultimate-guitar.com/print.php?what=article&id=95932

The only way to know what article corresponds to what ID field is to look at the value attribute of a certain input tag in the original article.

What I want to do is this:

Go to each page, incrementng through the varying numbers.

Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute.

Go to www.ultimate-guitar.com/print.php?what=article&id=<value>.
Append everything (minus <html><head> and <body> to a html file.

Rinse and repeat.

Is this possible? And is python the right language? Also, what dom/html/xml library should I use?

Thanks for any help.

Anorov · Accepted Answer · 2012-02-26 03:57:23Z

With lxml and urllib2:

import lxml.html
import urllib2

#implement the logic to download each page, with HTML strings in a sequence named pages
url = "http://www.ultimate-guitar.com/print.php?what=article&id=%s"

for page in pages:
    html = lxml.html.fromstring(page)
    ID = html.find(".//input[@name='rowid']").value
    article = urllib2.urlopen(url % ID).read()
    article_html = lxml.html.fromstring(article)
    with open(ID + ".html", "w") as html_file:
        html_file.write(article_html.find(".//body").text_content())

edit: Upon running this, it seems there may be some Unicode characters in the page. One way to get around this is to do article = article.encode("ascii", "ignore") or to put the encode method after .read(), to force ASCII and ignore Unicode, though this is a lazy fix.

This is assuming you just want the text content of everything inside the body tag. This will save files with the format of storyID.html (so "95932.html") in the local directory of the Python file. Change the save semantics if you like.

beerbajay · Accepted Answer · 2012-02-26 03:19:15Z

0

You could actually do this in javascript/jquery without too much trouble. javascripty-pseudocode, appending to an empty document:

for(var pageNum = 1; i<= 91; i++) {
    $.ajax({
        url: url + pageNum,
        async: false,
        success: function() {
            var printId = $('input[name="rowid"]').val();
            $.ajax({
                url: printUrl + printId,
                async: false,
                success: function(data) {
                    $('body').append($(data).find('body').contents());
                }
            });
        }
    });
}

After the loading completes you could save the resultant HTML to a file.

answered Feb 26, 2012 at 3:19

beerbajay

20.5k8 gold badges63 silver badges78 bronze badges

2 Comments

Vigrond Over a year ago

this would be considered cross-domain and would not work for browser security purposes

beerbajay Over a year ago

True. It would work as a greasemonkey script with a bit of modification.

Collectives™ on Stack Overflow

Python HTML - Get element by attribute

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related