Scrape nested URLs using Scrapy

Question

I am trying to scrape this web page:

https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/

I tried different ways, but every time it gives me a syntax error. I don't know much Python and Scrapy. Can anyone help me?

My requirements are:

In the header section of the page, there is a background image, some description and 2 product-related images.
In the Product Range section there are some number of images. I would like to go through all the images and scrape the individual product details.

The structure is like this:

Here is my code so far:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "plumber"
    start_urls = [
        'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',
    ]

def parse(self, response):
    for divs in response.css('div#product-variants div.viewport div.workspace div.float-box'):
        yield {
            #response.css('div#product-variants a::attr(href)').extract()
            'producturl': divs.css('a::attr(href)').extract(),
            'imageurl': divs.css('a img::attr(src)').extract(),
            'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract(),
             next_page = producturl
             next_page = response.urljoin(next_page)
             yield scrapy.Request(next_page, callback=self.parse)
        }

Granitosaurus · Accepted Answer · 2017-02-15 12:51:41Z

2

You should take next_page yield out of your item.
In general you can iterate through products, make some load and carry it over in your request's meta parameter, like so:

def parse(self, response):
    for divs in response.css('div#product-variants div.viewport div.workspace div.float-box'):
        item = {'producturl': divs.css('a::attr(href)').extract(),
                'imageurl': divs.css('a img::attr(src)').extract(),
                'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
         next_page = response.urljoin(item['producturl'])
         yield scrapy.Request(next_page, callback=self.parse_page, meta={'item': item})

def parse_page(self, response):
    """This is individual product page"""
    item = response.meta['item']
    item['something_new'] = 'some_value'
    return item

edited Feb 15, 2017 at 12:51

answered Feb 15, 2017 at 11:13

Granitosaurus

21.6k6 gold badges64 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

pradeep Over a year ago

to achieve this type requirements item is mandatory?. can u suggest any good site to scrap these type of nested url scraping

pradeep Over a year ago

i follwed the above way and it empty json file is created after scrape the url.In the middle of console ** <GET grohe.com/in/7780/bathroom/bathroom-faucets/essence> (referer: None) 2017-02-15 17:29:51 [scrapy] ERROR: Spider error processing <GET grohe.com/in/7780/bathroom/bathroom-faucets/essence> (referer: None) Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback yield next(it) File "/usr/lib/python2.7/dist-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spid **

Granitosaurus Over a year ago

@pradeep try my edit and could you post full error in pastebin of some sorts if it happens again?

pradeep Over a year ago

please look at stackoverflow.com/questions/42249725/…

Collectives™ on Stack Overflow

Scrape nested URLs using Scrapy

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related