0

I'm new to scrapy, learning atm and I'm trying to access JSON data on a page html and put them in a python dict and work with data later so I did try serval things, all failed, would appreciate if anyone could help me with that

I found the response.css to the desired tag which result looks like this in scrapy shell:

response.css('div.rich-snippet script').get()

'<script type="application/ld+json">{\n    some json data with newline chars \n  }\n    ]\n}</script>'

I need everything between {} but, so I tried regex to do it, like this:

response.css('div.rich-snippet script').re(r'\{[^}]*\}')

this regex should pick everything between brackets but there are more of these symbols in JSON and there are other things in the response before the JSON data so this returns just empty list I tried more but always the same results, an empty list

.re(r'<script>\{[^}]*\}</script>')
.re(r'<script>(.|\n)*?<\/script>')
...

so I tried something else, inside the spider I tried to parse the response directly to json.loads method and save the results in file from cli, that doesn't work either, perhaps I'm parsing the tag wrong or it's not even possible

    import scrapy
    import json

 class SomeSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'url'
    ]

    def parse(self, response, **kwargs):
        json_file = response.css('div.rich-snippet script').get()

        yield json.loads(json_file)

yet again, an empty result

Pls help me to understand, thanks.

2 Answers 2

1

Your css selector should specify that you only want the part inside the tag, that is should be ::text, so your code becomes:


    def parse(self, response, **kwargs):
        json_file = response.css('div.rich-snippet script::text')

        yield json.loads(json_file)

You might also want to have a look at: https://github.com/scrapinghub/extruct

It might better fit parsing ld+json

Sign up to request clarification or add additional context in comments.

3 Comments

Oh, I did not know about ::text. Better than fiddling with regular expressions, imo. +1
this is perfect, thank you! If I may ask, is it even possible to parse specific JSON data only like that? or is it's a good way to parse it all and work with result JSON dict later in python where I extract what I need? thank you
In this case: parse all of it. Source code will be cleaner and probably even faster. If you are on a resource constrained setting or deal with huge jsons there might be the use case for partial parsing.
0

You could take the response as a string and use a recursive regex on. Recursion is not supported by the original re module but by the newer regex one.
That said, a possible approach could be:

import regex

# code before 

some_json_string = response.css('div.rich-snippet script').get()
match = regex.search(r'\{(?:[^{}]*|(?R))+\}', some_json_string)

if match:
    relevant_json = match.group(0)
    # process it further here

See a demo on regex101.com for the expression.


Edit:

It seems that ::text is supported, so better use this answer instead.

1 Comment

thank you, I will use the ::text, but your regex will come handy as well :] I will try both ways just to try and see

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.