Paginated Scraping using Scrapy and Python

Question

I'm trying to scrape all of a site's entries and available content to try to learn using scrapy. So far, I've been able to scrape all of the blog entries on a page and then go to the next page and scrape the content there. I've also found the next page's link. However, I can't figure out how to procceed from there even though I've read quite a few tutorials and looked at example code. I have thus far :

class SaltandLavender(CrawlSpider):
    logging.getLogger('scrapy').propagate = False
    name = 'saltandlavender'
    allowed_domains=['saltandlavender.com']
    start_urls=['https://www.saltandlavender.com/category/recipes/']
    rules = (
        Rule(LinkExtractor(allow='https://www.saltandlavender.com/category/recipes/'),  callback="parse", follow= True),
    )


    def parse(self,response):
        #with open('page.html', 'wb') as html_file:
        #   html_file.write(response.body)
        print "start 1"
        for href in response.css('.entry-title a'):
            print "middle 1"
            yield response.follow(href, callback=self.process_page)
        next=response.css('li.pagination-next a::text')
        if next:
            url=''.join(response.css('li.pagination-next a::attr(href)').extract())
            print url
            Request(url)



    def process_page(self,response):
        print "start 2"
        post_images=response.css('div.entry-content img::attr(src)').extract()
        content =  {
                    'cuisine':''.join(response.xpath(".//span[@class='wprm-recipe-cuisine']/descendant::text()").extract()),
                    'title': ''.join(response.css('article.format-standard h1.entry-title::text').extract()),
                    #'content': response.xpath(".//div[@class='entry-content']/descendant::text()").extract(),
                    'ingredients': ''.join(response.css('div.wprm-recipe-ingredients-container div.wprm-recipe-ingredient-group').extract()),
                    #'time':response.css('wprm-recipe-total-time-container'),
                    'servings':''.join(response.css('span.wprm-recipe-servings::text').extract()),
                    'course':''.join(response.css('span.wprm-recipe-course::text').extract()),
                    'preparation':''.join(response.css('span.wprm-recipe-servings-name::text').extract()),
                    'url':''.join(response.url),
                    'postimage':''.join(post_images[1])
                    }
        #print content
        print "end 2"

    def errorCatch(self):
        print "Script encountered an error. Check selectors for changes in the site's layout and design..."
        return

    def updateValid(self):
        return



if __name__ == "__main__":
    LOG_ENABLED = False
    process = CrawlerProcess({
        #random.choice(useragent)
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })
    process.crawl(SaltandLavender)
    process.start()

vezunchik · Accepted Answer · 2019-04-25 09:21:00Z

1

There is something wrong with your next page requesting. For example, you use next variable, that is built-in reserved word and also you don't yield next request. Check this fix:

def parse(self,response):
    for href in response.css('.entry-title a'):
        yield response.follow(href, callback=self.process_page)
    next_page = response.css('li.pagination-next a::attr(href)').get()
    if next_page:
        yield response.follow(next_page)

answered Apr 25, 2019 at 9:21

vezunchik

3,7173 gold badges20 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Elle Stewards Over a year ago

Thank you, this works. I am a big fat newbie and honestly I have struggled with figuring out yield in relation to scrapy as I hadn't seen it before prior to this. So when you yield response.follow, it will always loop back and go through the parse function again?

vezunchik Over a year ago

Yes, by default it will call parse function again. But you can pass there another callback function, like here: yield response.follow(next_page, self.another_parse_function)

Gallaecio · Accepted Answer · 2019-04-25 09:36:44Z

0

You need to yield the request, not just create an instance of it.

Replace:

Request(url)

with:

yield Request(url)

answered Apr 25, 2019 at 9:36

Gallaecio

3,8872 gold badges33 silver badges69 bronze badges

Collectives™ on Stack Overflow

Paginated Scraping using Scrapy and Python

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related