I have a list of URLs I want to scrape.
The code I have works on the list if I use each URL by itself; however when I store the URLS in a file and use it in a loop it only goes up to the second URL and stops at the 3rd.
This is my code:
urls=open("file.txt")
url=urls.read()
main=url.split("\n")
url_number=0
while url_number<len(main):
page = requests.get(main[url_number])
tree = html.fromstring(page.text)
tournament = tree.xpath('//title/text()')
round1= tree.xpath('//div[@data-round]/span/text()')
scoreup= tree.xpath('//div[contains(@class, "top_score")]/text()')
scoredown= tree.xpath('//div[contains(@class, "bottom_score")]/text()')
url_number=url_number+1
print url_number
print "\n"
results = []
score_number=0
round_number=0
match_number=0
while round_number < len(round1):
match_number +=1
results.append(
[match_number,
round1[round_number],
scoreup[score_number],
round1[round_number+1],
scoredown[score_number],
tournament,])
round_number=round_number+2
score_number=score_number+1
print results
This code gives me up to the 2nd URL and prints only 3 for the 3rd one, which is url_number followed by this error.
scoredown[score_number],
IndexError: list index out of range
scoredown[score_number]line. Apparently your scraping logic does not work correctly with that specific URL. You should manually check that page and adjust your logic.