0

Just started learning web scraping using scrapy framework. I am trying to scrape reviews of a medicine from a medicinal website using the below code. But if i run "scrapy runspider spiders/medreview.py -o med.csv" , but error is coming like "INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)" and med.csv doe not have any data.

# Importing Scrapy Library
import scrapy

# Creating a new class to implement Spide
class MedSpider(scrapy.Spider):

# Spider name
name = 'reviews'

# Domain names to scrape
allowed_domains = ['1mg.com']

# Base URL for the MacBook air reviews
myBaseUrl = "https://www.1mg.com/otc/becosules-z-capsule-otc63496/amp"

# Defining a Scrapy parser
def parse(self, response):
        data = response.css('.OtcPage__reviews-container___hrKgt')
        ##data = response.css('.ReviewCards__review-card___3Z733')
        # Collecting user reviews
        comments = data.css('.ReviewCards__review-description___WoLdZ')
        count = 0
        # Combining the results
        for review in comments:
            yield{'comment': ''.join(review.xpath('.//text()').extract())
                 }
            count=count+1

Added "start_urls = myBaseUrl" based on the @stranac comment. Now I am getting some errors in the console.

    2020-09-28 16:04:34 [scrapy.core.engine] ERROR: Error while obtaining 
start requests
Traceback (most recent call last):
  File "E:\anaconda\lib\site-packages\scrapy\core\engine.py", line 129, in 
_next_request
request = next(slot.start_requests)
  File "E:\anaconda\lib\site-packages\scrapy\spiders\__init__.py", line 77, in start_requests
yield Request(url, dont_filter=True)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "E:\anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 69, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h
2
  • 3
    You defined neither start_urls nor start_requests(), your spider has nothing to parse Commented Sep 28, 2020 at 5:57
  • It's not start_urls=myBaseUrl, it will be start_urls=[myBaseUrl]. You got it wrong @Sumithra. Commented Sep 28, 2020 at 12:23

1 Answer 1

1

You are doing few things wrong. You are trying to scrape the reviews from the page where they don't exist. You can find the reviews either here or here. So, you need to use either of the suggested urls. To access the data it is necessary to define headers within your requests. Given that the following is one such way you can parse the data:

import scrapy

class MedSpider(scrapy.Spider):
    name = 'reviews'
    start_urls = [
        # "https://www.1mg.com/otc/becosules-z-capsule-otc63496"
        "https://www.1mg.com/otc/becosules-z-capsule-otc63496/reviews"
    ]
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36"}
    
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url,callback=self.parse,headers=self.headers)

    def parse(self,response):
        for review in response.css("[class^='ReviewCards__review-card']"):
            reviewer_name = review.css("[class^='ReviewCards__name']::text").get()
            reviewer_rating = review.css("[class^='Rating__ratings-container'] > span::text").get()
            print(reviewer_name,reviewer_rating)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.