1

I want to extract JSON data from a web page, so I've inspected it. Data I need is stored in the below format:

<script type="application/ld+json">
    {
     'data I want to extract'
    }
    </script>

I tried to use:

import scrapy
import json

class OpenriceSpider(scrapy.Spider):
    name = 'openrice'
    allowed_domains = ['www.openrice.com']

    def start_requests(self):
        headers = {
            'accept-encoding': 'gzip, deflate, sdch, br',
            'accept-language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36     (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
            'accept':     'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'cache-control': 'max-age=0',
        }
        url = 'https://www.openrice.com/en/hongkong/r-kitchen-one-cafe-sha-tin-western-r483821'
        yield scrapy.Request(url=url, headers=headers, callback=self.parse)

    def parse(self, response):  # response = request url ?
        items = []
        jsonresponse = json.loads(response)

But it doesn't work, how should I change it?

1 Answer 1

4

You need to locate that script element in the HTML source, extract it's text and only then load with json.loads():

script = response.xpath("//script[@type='application/ld+json']/text()").extract_first()
json_data = json.loads(script)
print(json_data)

Here, I am using the not so common application/ld+json to locate the script, but there are many other options as well - like, locate the script using some text you know it is in the script itself:

//script[contains(., 'Restaurant')]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.