Python - Web scraping using Scrapy

Question

Just started learning web scraping using scrapy framework. I am trying to scrape reviews of a medicine from a medicinal website using the below code. But if i run "scrapy runspider spiders/medreview.py -o med.csv" , but error is coming like "INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)" and med.csv doe not have any data.

# Importing Scrapy Library
import scrapy

# Creating a new class to implement Spide
class MedSpider(scrapy.Spider):

# Spider name
name = 'reviews'

# Domain names to scrape
allowed_domains = ['1mg.com']

# Base URL for the MacBook air reviews
myBaseUrl = "https://www.1mg.com/otc/becosules-z-capsule-otc63496/amp"

# Defining a Scrapy parser
def parse(self, response):
        data = response.css('.OtcPage__reviews-container___hrKgt')
        ##data = response.css('.ReviewCards__review-card___3Z733')
        # Collecting user reviews
        comments = data.css('.ReviewCards__review-description___WoLdZ')
        count = 0
        # Combining the results
        for review in comments:
            yield{'comment': ''.join(review.xpath('.//text()').extract())
                 }
            count=count+1

Added "start_urls = myBaseUrl" based on the @stranac comment. Now I am getting some errors in the console.

    2020-09-28 16:04:34 [scrapy.core.engine] ERROR: Error while obtaining 
start requests
Traceback (most recent call last):
  File "E:\anaconda\lib\site-packages\scrapy\core\engine.py", line 129, in 
_next_request
request = next(slot.start_requests)
  File "E:\anaconda\lib\site-packages\scrapy\spiders\__init__.py", line 77, in start_requests
yield Request(url, dont_filter=True)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 69, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h

You defined neither start_urls nor start_requests(), your spider has nothing to parse — stranac
– stranac, Commented Sep 28, 2020 at 5:57
It's not start_urls=myBaseUrl, it will be start_urls=[myBaseUrl]. You got it wrong @Sumithra. — SIM
– SIM, Commented Sep 28, 2020 at 12:23

SIM · Accepted Answer · 2020-09-28 11:30:39Z

You are doing few things wrong. You are trying to scrape the reviews from the page where they don't exist. You can find the reviews either here or here. So, you need to use either of the suggested urls. To access the data it is necessary to define headers within your requests. Given that the following is one such way you can parse the data:

import scrapy

class MedSpider(scrapy.Spider):
    name = 'reviews'
    start_urls = [
        # "https://www.1mg.com/otc/becosules-z-capsule-otc63496"
        "https://www.1mg.com/otc/becosules-z-capsule-otc63496/reviews"
    ]
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}
    
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url,callback=self.parse,headers=self.headers)

    def parse(self,response):
        for review in response.css("[class^='ReviewCards__review-card']"):
            reviewer_name = review.css("[class^='ReviewCards__name']::text").get()
            reviewer_rating = review.css("[class^='Rating__ratings-container'] > span::text").get()
            print(reviewer_name,reviewer_rating)

Collectives™ on Stack Overflow

Python - Web scraping using Scrapy

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related