There is music website I regularly read, and it has a section where users post their own fictional music-related stories. There is a 91 part series (Written over a length of time, uploaded part by part) that always follows the convention of: http://www.ultimate-guitar.com/columns/fiction/riot_band_blues_part_#.html.
I would like to be able to get just the formatted text from every part and put it into one html file.
Conveniently, there is a link to a print version, correctly formatted for my purposes. All I would have to do is write a script to download all of the parts and then dump them into file. Not hard.
Unfortunately, the url for a print version is as follows: www.ultimate-guitar.com/print.php?what=article&id=95932
The only way to know what article corresponds to what ID field is to look at the value attribute of a certain input tag in the original article.
What I want to do is this:
Go to each page, incrementng through the varying numbers.
Find the <input> tag with attribute 'name="rowid"' and get the number in it's 'value=' attribute.
Go to www.ultimate-guitar.com/print.php?what=article&id=<value>.
Append everything (minus <html><head> and <body> to a html file.
Rinse and repeat.
Is this possible? And is python the right language? Also, what dom/html/xml library should I use?
Thanks for any help.