0

I'm trying to scrape all of a site's entries and available content to try to learn using scrapy. So far, I've been able to scrape all of the blog entries on a page and then go to the next page and scrape the content there. I've also found the next page's link. However, I can't figure out how to procceed from there even though I've read quite a few tutorials and looked at example code. I have thus far :

class SaltandLavender(CrawlSpider):
    logging.getLogger('scrapy').propagate = False
    name = 'saltandlavender'
    allowed_domains=['saltandlavender.com']
    start_urls=['https://www.saltandlavender.com/category/recipes/']
    rules = (
        Rule(LinkExtractor(allow='https://www.saltandlavender.com/category/recipes/'),  callback="parse", follow= True),
    )


    def parse(self,response):
        #with open('page.html', 'wb') as html_file:
        #   html_file.write(response.body)
        print "start 1"
        for href in response.css('.entry-title a'):
            print "middle 1"
            yield response.follow(href, callback=self.process_page)
        next=response.css('li.pagination-next a::text')
        if next:
            url=''.join(response.css('li.pagination-next a::attr(href)').extract())
            print url
            Request(url)



    def process_page(self,response):
        print "start 2"
        post_images=response.css('div.entry-content img::attr(src)').extract()
        content =  {
                    'cuisine':''.join(response.xpath(".//span[@class='wprm-recipe-cuisine']/descendant::text()").extract()),
                    'title': ''.join(response.css('article.format-standard h1.entry-title::text').extract()),
                    #'content': response.xpath(".//div[@class='entry-content']/descendant::text()").extract(),
                    'ingredients': ''.join(response.css('div.wprm-recipe-ingredients-container div.wprm-recipe-ingredient-group').extract()),
                    #'time':response.css('wprm-recipe-total-time-container'),
                    'servings':''.join(response.css('span.wprm-recipe-servings::text').extract()),
                    'course':''.join(response.css('span.wprm-recipe-course::text').extract()),
                    'preparation':''.join(response.css('span.wprm-recipe-servings-name::text').extract()),
                    'url':''.join(response.url),
                    'postimage':''.join(post_images[1])
                    }
        #print content
        print "end 2"

    def errorCatch(self):
        print "Script encountered an error. Check selectors for changes in the site's layout and design..."
        return

    def updateValid(self):
        return



if __name__ == "__main__":
    LOG_ENABLED = False
    process = CrawlerProcess({
        #random.choice(useragent)
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })
    process.crawl(SaltandLavender)
    process.start()

2 Answers 2

1

There is something wrong with your next page requesting. For example, you use next variable, that is built-in reserved word and also you don't yield next request. Check this fix:

def parse(self,response):
    for href in response.css('.entry-title a'):
        yield response.follow(href, callback=self.process_page)
    next_page = response.css('li.pagination-next a::attr(href)').get()
    if next_page:
        yield response.follow(next_page)
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, this works. I am a big fat newbie and honestly I have struggled with figuring out yield in relation to scrapy as I hadn't seen it before prior to this. So when you yield response.follow, it will always loop back and go through the parse function again?
Yes, by default it will call parse function again. But you can pass there another callback function, like here: yield response.follow(next_page, self.another_parse_function)
0

You need to yield the request, not just create an instance of it.

Replace:

Request(url)

with:

yield Request(url)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.