2

There is music website I regularly read, and it has a section where users post their own fictional music-related stories. There is a 91 part series (Written over a length of time, uploaded part by part) that always follows the convention of: http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html.

I would like to be able to get just the formatted text from every part and put it into one html file.

Conveniently, there is a link to a print version, correctly formatted for my purposes. All I would have to do is write a script to download all of the parts and then dump them into file. Not hard.

Unfortunately, the url for a print version is as follows: www.ultimate-guitar.com/print.php?what=article&id=95932

The only way to know what article corresponds to what ID field is to look at the value attribute of a certain input tag in the original article.

What I want to do is this:

Go to each page, incrementng through the varying numbers.

Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute.

Go to www.ultimate-guitar.com/print.php?what=article&id=<value>.
Append everything (minus <html><head> and <body> to a html file.

Rinse and repeat.

Is this possible? And is python the right language? Also, what dom/html/xml library should I use?

Thanks for any help.

2 Answers 2

1

With lxml and urllib2:

import lxml.html
import urllib2

#implement the logic to download each page, with HTML strings in a sequence named pages
url = "http://www.ultimate-guitar.com/print.php?what=article&id=%s"

for page in pages:
    html = lxml.html.fromstring(page)
    ID = html.find(".//input[@name='rowid']").value
    article = urllib2.urlopen(url % ID).read()
    article_html = lxml.html.fromstring(article)
    with open(ID + ".html", "w") as html_file:
        html_file.write(article_html.find(".//body").text_content())

edit: Upon running this, it seems there may be some Unicode characters in the page. One way to get around this is to do article = article.encode("ascii", "ignore") or to put the encode method after .read(), to force ASCII and ignore Unicode, though this is a lazy fix.

This is assuming you just want the text content of everything inside the body tag. This will save files with the format of storyID.html (so "95932.html") in the local directory of the Python file. Change the save semantics if you like.

Sign up to request clarification or add additional context in comments.

Comments

0

You could actually do this in javascript/jquery without too much trouble. javascripty-pseudocode, appending to an empty document:

for(var pageNum = 1; i<= 91; i++) {
    $.ajax({
        url: url + pageNum,
        async: false,
        success: function() {
            var printId = $('input[name="rowid"]').val();
            $.ajax({
                url: printUrl + printId,
                async: false,
                success: function(data) {
                    $('body').append($(data).find('body').contents());
                }
            });
        }
    });
}

After the loading completes you could save the resultant HTML to a file.

2 Comments

this would be considered cross-domain and would not work for browser security purposes
True. It would work as a greasemonkey script with a bit of modification.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.