Python Requests Scraping

Question

I have a list of URLs I want to scrape.

The code I have works on the list if I use each URL by itself; however when I store the URLS in a file and use it in a loop it only goes up to the second URL and stops at the 3rd.

This is my code:

urls=open("file.txt")

url=urls.read()


main=url.split("\n")

url_number=0
while url_number<len(main):
    page = requests.get(main[url_number])
    tree = html.fromstring(page.text)


    tournament = tree.xpath('//title/text()')

    round1= tree.xpath('//div[@data-round]/span/text()')
    scoreup= tree.xpath('//div[contains(@class, "top_score")]/text()')
    scoredown= tree.xpath('//div[contains(@class, "bottom_score")]/text()')

    url_number=url_number+1
    print url_number

    print "\n"
    results = [] 
    score_number=0
    round_number=0
    match_number=0

    while round_number < len(round1):
        match_number +=1
        results.append(
                        [match_number,
                        round1[round_number],
                        scoreup[score_number],
                        round1[round_number+1],
                        scoredown[score_number],
                        tournament,])
        round_number=round_number+2
        score_number=score_number+1


    print results

This code gives me up to the 2nd URL and prints only 3 for the 3rd one, which is url_number followed by this error.

scoredown[score_number],
IndexError: list index out of range

The error you showed indicates clearly that the problem is in the scoredown[score_number] line. Apparently your scraping logic does not work correctly with that specific URL. You should manually check that page and adjust your logic. — Vasiliy Faronov
– Vasiliy Faronov, Commented Mar 4, 2014 at 18:13
@bluenic Does the error always happen with the 3rd URL, no matter which URL that is, and no matter how many there are in total? — Vasiliy Faronov
– Vasiliy Faronov, Commented Mar 4, 2014 at 18:29

Hugh Bothwell · Accepted Answer · 2014-03-04 19:14:32Z

I've rewritten this to take advantage of Python iterators instead of indexed loops:

from itertools import count, cycle, islice, izip
import lxml.html as lh
import requests

URL_FILE   = "file.txt"
ROW_FORMAT = "{:>5}     {:18} {:>5}     {:18} {:>5}     {}".format
HEADER     = ['Match', 'Up', 'Score', 'Down', 'Score', 'Tournament']

lookfor = {
    'tournament': '//title/text()',
    'round1':     '//div[@data-round]/span/text()',
    'scoreup':    '//div[contains(@class, "top_score")]/text()',
    'scoredown':  '//div[contains(@class, "bottom_score")]/text()'
}

def main():
    with open(URL_FILE) as inf:
        urls = (line.strip() for line in inf)
        for num,url in enumerate(urls, 1):
            # get data
            txt = requests.get(url).text
            tree = lh.fromstring(txt)

            # pull out the bits we want        
            scraped = {name:tree.xpath(path) for name,path in lookfor.items()}
            # (and fix the title)
            title = scraped['tournament']
            tournament = title[0].replace('\n', '') if title else ''

            # reslice the data so it lines up        
            match_num   = count(1)                               # 1, 2, 3, ...
            up_rounds   = islice(scraped['round1'], 0, None, 2)  # even rounds
            down_rounds = islice(scraped['round1'], 1, None, 2)  # odd rounds
            tournament  = cycle([tournament])      # repeats the name
            # ... and reassemble it        
            results = izip(match_num, up_rounds, scraped['scoreup'], down_rounds, scraped['scoredown'], tournament)

            # generate output
            print("\nRound {}:".format(num))

            print(ROW_FORMAT(*HEADER))
            for row in results:
                print(ROW_FORMAT(*row))

if __name__=="__main__":
    main()

which, on the given url, results in:

Round 1:
Match     Up                 Score     Down               Score     Tournament
    1     Halcyon.680            2     Dubrick.528            0     IG Spring 2011 Omega Divisional #11 - Challonge
    2     rabidsnowman.208       0     Drunkenboi.856         2     IG Spring 2011 Omega Divisional #11 - Challonge
    3     GoSuRum.612            2     Halcyon.680            1     IG Spring 2011 Omega Divisional #11 - Challonge
    4     Hummingbird.656        1     hammy.161              2     IG Spring 2011 Omega Divisional #11 - Challonge
    5     Cryptic.528            2     Divination.275         0     IG Spring 2011 Omega Divisional #11 - Challonge
    6     tqyrusecc.243          2     Kodak.775              1     IG Spring 2011 Omega Divisional #11 - Challonge
    7     coLrsvp.138            0     Drunkenboi.856         1     IG Spring 2011 Omega Divisional #11 - Challonge
    8     Sharo.803              0     ices.813               2     IG Spring 2011 Omega Divisional #11 - Challonge
    9     vpchance.970           2     hayes.848              0     IG Spring 2011 Omega Divisional #11 - Challonge
   10     Leif.812               0     Amalaxinaoum.405       0     IG Spring 2011 Omega Divisional #11 - Challonge
   11     GoSuRum.612            0     hammy.161              2     IG Spring 2011 Omega Divisional #11 - Challonge
   12     Cryptic.528            0     tqyrusecc.243          2     IG Spring 2011 Omega Divisional #11 - Challonge
   13     Drunkenboi.856         1     ices.813               2     IG Spring 2011 Omega Divisional #11 - Challonge
   14     vpchance.970           2     Amalaxinaoum.405       0     IG Spring 2011 Omega Divisional #11 - Challonge
   15     hammy.161              0     tqyrusecc.243          2     IG Spring 2011 Omega Divisional #11 - Challonge
   16     ices.813               1     vpchance.970           2     IG Spring 2011 Omega Divisional #11 - Challonge
   17     tqyrusecc.243          0     vpchance.970           3     IG Spring 2011 Omega Divisional #11 - Challonge

Slater Victoroff · Accepted Answer · 2014-03-04 18:15:05Z

1

I've got a few suggestions for your code, which should help you avoid issues like this in the future. Mainly there's no reason to be using all of those while loops. You would be much better suited to just explicitly loop through your lists using a for loop.

Also python has a lot of builtins that mean you don't have to do the dirty work. Together these two can transform the first part of your code into this:

for url in open('file.txt').readlines():

It's hard to be entirely sure without seeing the urls you're scraping, but I'm willing to bet that the sizes of your return lists aren't as consistent as you think they are.

Your xpath selectors don't look particularly narrow, and since you're using a while loop instead of explicitly looping through your results if you get a number of values in your round1, scoreup, or scoredown lists, slightly different from what you expected, you'd get errors like this.

answered Mar 4, 2014 at 18:15

Slater Victoroff

22k23 gold badges92 silver badges149 bronze badges

2 Comments

blueinc Over a year ago

i am still new to python this is an example of one of the sites challonge.com/IG2011Springomega11

Dov Grobgeld Over a year ago

.readlines() is redundant as filehandles by themself are iterators for the lines in the file. Thus you can do: for url in open('file.txt'): .

Collectives™ on Stack Overflow

Python Requests Scraping

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related