0

I am trying to scrape this web page:

https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/

I tried different ways, but every time it gives me a syntax error. I don't know much Python and Scrapy. Can anyone help me?

My requirements are:

  • In the header section of the page, there is a background image, some description and 2 product-related images.

  • In the Product Range section there are some number of images. I would like to go through all the images and scrape the individual product details.

The structure is like this:

enter image description here

Here is my code so far:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "plumber"
    start_urls = [
        'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',
    ]

def parse(self, response):
    for divs in response.css('div#product-variants div.viewport div.workspace div.float-box'):
        yield {
            #response.css('div#product-variants a::attr(href)').extract()
            'producturl': divs.css('a::attr(href)').extract(),
            'imageurl': divs.css('a img::attr(src)').extract(),
            'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract(),
             next_page = producturl
             next_page = response.urljoin(next_page)
             yield scrapy.Request(next_page, callback=self.parse)
        }

1 Answer 1

2

You should take next_page yield out of your item.
In general you can iterate through products, make some load and carry it over in your request's meta parameter, like so:

def parse(self, response):
    for divs in response.css('div#product-variants div.viewport div.workspace div.float-box'):
        item = {'producturl': divs.css('a::attr(href)').extract(),
                'imageurl': divs.css('a img::attr(src)').extract(),
                'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
         next_page = response.urljoin(item['producturl'])
         yield scrapy.Request(next_page, callback=self.parse_page, meta={'item': item})

def parse_page(self, response):
    """This is individual product page"""
    item = response.meta['item']
    item['something_new'] = 'some_value'
    return item
Sign up to request clarification or add additional context in comments.

4 Comments

to achieve this type requirements item is mandatory?. can u suggest any good site to scrap these type of nested url scraping
i follwed the above way and it empty json file is created after scrape the url.In the middle of console ** <GET grohe.com/in/7780/bathroom/bathroom-faucets/essence> (referer: None) 2017-02-15 17:29:51 [scrapy] ERROR: Spider error processing <GET grohe.com/in/7780/bathroom/bathroom-faucets/essence> (referer: None) Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spid **
@pradeep try my edit and could you post full error in pastebin of some sorts if it happens again?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.