1

I'm following the Scrapy official tutorial where I'm suppoused to scrape data from http://quotes.toscrape.com, the tutorial shows how to scrape the data with the following spider:

class QuotesSpiderCss(scrapy.Spider):
    name = "quotes_css"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        quotes = response.css('div.quote')
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags::text').extract()
            }

Then crawling the spider to a JSON file it returns what's spected:

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
...]

I'm trying to write the same Spider using xpath instead of css:

class QuotesSpiderXpath(scrapy.Spider):
    name = 'quotes_xpath'
    start_urls = [
        'http://quotes.toscrape.com/page/1/'
    ]

    def parse(self, response):
        quotes = response.xpath('//div[@class="quote"]')
        for quote in quotes:
            yield {
                'text': quote.xpath("//span[@class='text']/text()").extract_first(),
                'author': quote.xpath("//small[@class='author']/text()").extract_first(),
                'tags': quote.xpath("//div[@class='tags']/text()").extract()
            }

But this spider returns me a list with the same quote:

[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n        ", "\n            Tags:\n            ", " \n            \n            ", "\n            \n            ", "\n            \n            ", "\n            \n        "]},
...]

Thanks in advance!

1 Answer 1

3

The reason you get always the same quote is because you're not using a relative XPath. See documentation.

Add a prefixing dot to your XPath statements like in the following parse method:

def parse(self, response):
    quotes = response.xpath('//div[@class="quote"]')
    for quote in quotes:
        yield {
            'text': quote.xpath(".//span[@class='text']/text()").extract_first(),
            'author': quote.xpath(".//small[@class='author']/text()").extract_first(),
            'tags': quote.xpath(".//div[@class='tags']/text()").extract()
        }
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.