I'm trying to scrape links off of a table using pagination. I can get Selenium to iterate through the pages and I can get the links off of the first page, however if I try to combine the two, when I get to the last page and there is no longer a next page button, the process stops, and I get nothing.
I am unsure how to gracefully tell the thing to simply return the data to csv. I am using a while true: loop so it's rather puzzling to me.
Another question has to do with targeting the links I am trying to parse using xpath. The links are held in two different tr-classes. One set is under //tr[@class ="resultsY"] and the other under //tr[@class ="resultsW"], is there an OR statement of some sort I can use to target all of the links in one go?
One solution I found:
'//tr[@class ="resultsY"] | //tr[@class ="resultsW"]' gives me an error every time.
Here's the html table:
<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a> <----a link i'm after
-<td>
-<td></td>
</tr>
<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a> <----a link i'm after
-<td>
-<td></td>
</tr>
And here is my scrapy:
import time
from scrapy.item import Item, Field
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from scrapy.selector import HtmlXPathSelector
class ElyseAvenueItem(Item):
link = Field()
link2 = Field()
class ElyseAvenueSpider(BaseSpider):
name = "s1"
allowed_domains = ["nces.ed.gov"]
start_urls = [
'https://nces.ed.gov/collegenavigator/']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
select = Select(self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_ucMapMain_lstState"))
select.deselect_by_visible_text("No Preference")
select.select_by_visible_text("Alabama")
self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_btnSearch").click()
#here is the while loop. it gets to the end of the table and says...no more "next page" and gives me the middle finger
'''while True:
el1 = self.driver.find_element_by_partial_link_text("Next Page")
if el1:
el1.click()
else:
#return(items)
self.driver.close()'''
hxs = HtmlXPathSelector(response)
'''
#here i tried: titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"] | //tr[@class ="resultsY"]') and i got an error saying that
titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"]')
items = []
for titles in titles:
item = ElyseAvenueItem()
#here i'd like to be able to target all of the hrefs...not sure how
link = titles.find_element_by_xpath('//tr[@class ="resultsW"]/td[2]/a')
item ["link"] = link.get_attribute('href')
items.append(item)
yield(items)