0

I am testing to see if I can scrape a website using scrapy. I get response from the site but I can access the elements or data I want. My selector is right and I dont think there is error in the commands although I am beginner in scrapy. I want to get tags with class results-race-name I runed it through scrapy shell In shell I used th following commands

In [1]: fetch('https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/')

2022-01-07 15:08:58 [scrapy.core.engine] INFO: Spider opened
2022-01-07 15:09:01 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://greyhoundbet.racingpost.com/robots.txt> (referer: None)
2022-01-07 15:09:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/#results-list/r_date=2021-01-01/> (referer: None)

In [2]: view(response)
Out[2]: True

In [3]: response.css('.results-race-name').extract()
Out[3]: []

Note the view(response) gives me the output till the loading logo

1 Answer 1

1

It's not a css problem. The data is created dynamically. You can get it from the json file (open devtools in the browser click on the network tab, look at the json request and get what you need).

In [1]: req = scrapy.Request('https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cm
   ...: eetings')

In [2]: fetch(req)
[scrapy.core.engine] INFO: Spider opened
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings> (referer: None)

In [3]: json_data = response.json()

In [4]: for data in json_data['meetings']['tracks']['1']['races']:
   ...:     print(data['track'])
   ...:
Newcastle
Swindon
Kinsley

In [5]: for data in json_data['meetings']['tracks']['2']['races']:
   ...:     print(data['track'])
   ...:
Monmore
Crayford
Hove
Harlow
Henlow

EDIT:

spider.py

import scrapy


class ExampleSpider(scrapy.Spider):
    name = "exampleSpider"
    start_urls = ['https://greyhoundbet.racingpost.com/results/blocks.sd?r_date=2021-01-01&blocks=header%2Cmeetings']

    def parse(self, response):
        json_data = response.json()

        for data in json_data['meetings']['tracks']['1']['races']:
            yield {'race': data['track']}

        for data in json_data['meetings']['tracks']['2']['races']:
            yield {'race': data['track']}

Example for spider

main.py:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

if __name__ == "__main__":
    spider = 'exampleSpider'
    settings = get_project_settings()
    settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    process = CrawlerProcess(settings)
    process.crawl(spider)
    process.start()

How to run scrapy from a script

Sign up to request clarification or add additional context in comments.

6 Comments

thanks i am new to scrapy It helped alot
can I use this response.css('.results-race-name').extract()
Please why and how did you modified the url
In this case you can't use the css selector simply because that content is generated with javascript and scrapy don't parse javascript. Here what I did is looking in the browser's devtools in the network tab and watch for the json it uses for the data. This is the url I fetched. Next time if you want to be sure, then turn off javascript in your browser and see if the site loads the information you need or not.
thanks can you edit to show how to do it in spider.py file I am facing some errors
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.