scrapy pagination selenium python

Question

I'm trying to scrape links off of a table using pagination. I can get Selenium to iterate through the pages and I can get the links off of the first page, however if I try to combine the two, when I get to the last page and there is no longer a next page button, the process stops, and I get nothing.

I am unsure how to gracefully tell the thing to simply return the data to csv. I am using a while true: loop so it's rather puzzling to me.

Another question has to do with targeting the links I am trying to parse using xpath. The links are held in two different tr-classes. One set is under //tr[@class ="resultsY"] and the other under //tr[@class ="resultsW"], is there an OR statement of some sort I can use to target all of the links in one go?

One solution I found: '//tr[@class ="resultsY"] | //tr[@class ="resultsW"]' gives me an error every time.

Here's the html table:

<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a>        <----a link i'm after
-<td>
-<td></td>
</tr>
<tr class="resultsW">
-<td></td>
-<td>
----<a href="fdafda"></a>        <----a link i'm after
-<td>
-<td></td>
</tr>

And here is my scrapy:

import time
from scrapy.item import Item, Field
from selenium import webdriver
from scrapy.spider import BaseSpider
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException
from scrapy.selector import HtmlXPathSelector

class ElyseAvenueItem(Item):
    link = Field()   
    link2 = Field()

class ElyseAvenueSpider(BaseSpider):
    name = "s1"
    allowed_domains = ["nces.ed.gov"]
    start_urls = [
    'https://nces.ed.gov/collegenavigator/']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        select = Select(self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_ucMapMain_lstState"))
        select.deselect_by_visible_text("No Preference")
        select.select_by_visible_text("Alabama")
        self.driver.find_element_by_id("ctl00_cphCollegeNavBody_ucSearchMain_btnSearch").click()

#here is the while loop. it gets to the end of the table and says...no more "next page" and gives me the middle finger

        '''while True:
            el1 = self.driver.find_element_by_partial_link_text("Next Page")
            if el1:
                el1.click()
            else:
                #return(items)
                self.driver.close()'''
        hxs = HtmlXPathSelector(response)

        '''
#here i tried: titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"] | //tr[@class ="resultsY"]') and i got an error saying that

        titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"]')
        items = []
        for titles in titles:
            item = ElyseAvenueItem()

#here i'd like to be able to target all of the hrefs...not sure how

            link = titles.find_element_by_xpath('//tr[@class ="resultsW"]/td[2]/a')
            item ["link"] = link.get_attribute('href')
            items.append(item)
        yield(items)

Community · Accepted Answer · 2017-05-23 11:52:31Z

1

Breaking this post into three posts would increase your chances of getting good answers.

For the first question, it would be helpful to know more precisely what "the process stops and i get nothing" means. I don't see you trying write the "links" to a file. I don't understand why you do what you do in your else clause.

For the second question, using regular expression might do the trick. See this.

For the third question, since the element title is the list

titles = self.driver.find_elements_by_xpath('//tr[@class ="resultsW"]') ,

you could just do

hrefs=[]
for titles in titles:
    href = titles.find_element_by_xpath('a').get_attribute('href')
    hrefs.append(href)

As an aside, if all that you trying to do is to get the links off a page, consider mechanize, lxml.html and|or BeautifulSoup.

edited May 23, 2017 at 11:52

CommunityBot

11 silver badge

answered Jul 31, 2013 at 3:49

djas

1,0239 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

scrapy pagination selenium python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related