0

I have a website name https://www.grohe.com/in In that page i want to get one type of bathroom faucets https://www.grohe.com/in/25796/bathroom/bathroom-faucets/grandera/ In that page there are multiple products/related products.I want to get each product url and scrap the data.For that i wrote like this...

My items.py file looks like

from scrapy.item import Item, Field

class ScrapytestprojectItem(Item):
    producturl=Field()
    imageurl=Field()
    description=Field()

spider code is

import scrapy
from ScrapyTestProject.items import ScrapytestprojectItem
class QuotesSpider(scrapy.Spider):
    name = "nestedurl"
    allowed_domains = ['www.grohe.com']
    start_urls = [
    'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',
    ]

    def parse(self, response):
    for divs in response.css('div.viewport div.workspace div.float-box'):
        item = {'producturl': divs.css('a::attr(href)').extract(),
                'imageurl': divs.css('a img::attr(src)').extract(),
                'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
        next_page = response.urljoin(item['producturl'])
        yield scrapy.Request(next_page, callback=self.parse, meta={'item': item})

when i ran the scrapy **scrapy crawl nestedurl -o nestedurl.csv ** empty file created. The console is

2017-02-15 18:03:11 [scrapy] DEBUG: Telnet console listening on    127.0.0.1:6024
2017-02-15 18:03:13 [scrapy] DEBUG: Crawled (200) <GET  https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/>  (referer: None)
2017-02-15 18:03:13 [scrapy] ERROR: Spider error processing <GET   https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/>   (referer: None)
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
         File "/usr/lib/python2.7/dist-        packages/scrapy/spidermiddlewares/offsite.py", line 28, in     process_spider_output
     for x in result:
       File "/usr/lib/python2.7/dist-    packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr>
      return (_set_referer(r) for r in result or ())
       File "/usr/lib/python2.7/dist-     packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/usr/lib/python2.7/dist-  packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr>
   return (r for r in result or () if _filter(r))
 File    "/home/pradeep/ScrapyTestProject/ScrapyTestProject/spiders/nestedurl.py",    line 15, in parse
   next_page = response.urljoin(item['producturl'])
      File "/usr/lib/python2.7/dist-packages/scrapy/http/response/text.py",    line 72, in urljoin
   return urljoin(get_base_url(self), url)
      File "/usr/lib/python2.7/urlparse.py", line 261, in urljoin
   urlparse(url, bscheme, allow_fragments)
     File "/usr/lib/python2.7/urlparse.py", line 143, in urlparse
  tuple = urlsplit(url, scheme, allow_fragments)
     File "/usr/lib/python2.7/urlparse.py", line 176, in urlsplit
     cached = _parse_cache.get(key, None)
    TypeError: unhashable type: 'list'
    2017-02-15 18:03:13 [scrapy] INFO: Closing spider (finished)
   2017-02-15 18:03:13 [scrapy] INFO: Dumping Scrapy stats:
          {'downloader/request_bytes': 253,
          'downloader/request_count': 1,
       'downloader/request_method_count/GET': 1,
          'downloader/response_bytes': 31063,
     'downloader/response_count': 1,
        'downloader/response_status_count/200': 1,
            'finish_reason': 'finished',
        'finish_time': datetime.datetime(2017, 2, 15, 12, 33, 13, 396542),
        'log_count/DEBUG': 3,
          'log_count/ERROR': 3,
          'log_count/INFO': 7,
          'response_received_count': 1,
       'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
          'scheduler/enqueued': 1,
          'scheduler/enqueued/memory': 1,
          'spider_exceptions/TypeError': 1,
           'start_time': datetime.datetime(2017, 2, 15, 12, 33, 11, 568424)}
          2017-02-15 18:03:13 [scrapy] INFO: Spider closed (finished)

3 Answers 3

0

I think item divs.css('a::attr(href)').extract() sometimes returns a list which when used in urljoin leads which causes urlparse to crash as it can not hash a list.

Sign up to request clarification or add additional context in comments.

Comments

0

URL is not being generated correctly.

You should enable logging, and log some messages to debug your code.

import scrapy, logging
from ScrapyTestProject.items import ScrapytestprojectItem
class QuotesSpider(scrapy.Spider):
    name = "nestedurl"
    allowed_domains = ['www.grohe.com']
    start_urls = [
    'https://www.grohe.com/in/7780/bathroom/bathroom-faucets/essence/',
    ]

    def parse(self, response):
    for divs in response.css('div.viewport div.workspace div.float-box'):
        item = {'producturl': divs.css('a::attr(href)').extract(),
                'imageurl': divs.css('a img::attr(src)').extract(),
                'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
        next_page = response.urljoin(item['producturl'])

        logging.info(next_page ) # see what it prints in console.

        yield scrapy.Request(next_page, callback=self.parse, meta={'item': item})

3 Comments

generated url is like '/in/8257/bathroom/bathroom-faucets/essence/product-details/?product=19408-G145&color=000&material=19408000' it shoud be append to 'www.grohe.in' url then it makes sence
loger info [grohe.com/in/8257/bathroom/bathroom-faucets/essence/… ....sameway multiple urls are formed
No, you can manually join URL like "www.grohe.in" + item['producturl']
0
    item = {'producturl': divs.css('a::attr(href)').extract(),  # <--- issue here
            'imageurl': divs.css('a img::attr(src)').extract(),
            'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
    next_page = response.urljoin(item['producturl'])  # <--- here item['producturl'] is a list

To fix this use .extract_first(''):

    item = {'producturl': divs.css('a::attr(href)').extract_fist(''),
            'imageurl': divs.css('a img::attr(src)').extract_first(''),
            'description' : divs.css('a div.text::text').extract() + divs.css('a span.nowrap::text').extract()}
    next_page = response.urljoin(item['producturl'])

2 Comments

In my spider code i used .extract_first()/.extract_first(' ') .still same output no change.samething i tested in scrapy shell with .extract() it self.it seems good
producturl is like ----> /in/8257/bathroom/bathroom-faucets/essence/product-details/?product=19408-G145&color=000&material=19408000 after that we form the link as 'grohe.com/in/8257/bathroom/bathroom-faucets/essence/…'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.