regex that access json data from javascript html tag with scrapy

Question

I'm new to scrapy, learning atm and I'm trying to access JSON data on a page html and put them in a python dict and work with data later so I did try serval things, all failed, would appreciate if anyone could help me with that

I found the response.css to the desired tag which result looks like this in scrapy shell:

response.css('div.rich-snippet script').get()

'<script type="application/ld+json">{\n    some json data with newline chars \n  }\n    ]\n}</script>'

I need everything between {} but, so I tried regex to do it, like this:

response.css('div.rich-snippet script').re(r'\{[^}]*\}')

this regex should pick everything between brackets but there are more of these symbols in JSON and there are other things in the response before the JSON data so this returns just empty list I tried more but always the same results, an empty list

.re(r'<script>\{[^}]*\}</script>')
.re(r'<script>(.|\n)*?<\/script>')
...

so I tried something else, inside the spider I tried to parse the response directly to json.loads method and save the results in file from cli, that doesn't work either, perhaps I'm parsing the tag wrong or it's not even possible

    import scrapy
    import json

 class SomeSpider(scrapy.Spider):
    name = 'test'
    start_urls = [
        'url'
    ]

    def parse(self, response, **kwargs):
        json_file = response.css('div.rich-snippet script').get()

        yield json.loads(json_file)

yet again, an empty result

Pls help me to understand, thanks.

FlorianLudwig · Accepted Answer · 2021-10-10 14:05:54Z

1

Your css selector should specify that you only want the part inside the tag, that is should be ::text, so your code becomes:


    def parse(self, response, **kwargs):
        json_file = response.css('div.rich-snippet script::text')

        yield json.loads(json_file)

You might also want to have a look at: https://github.com/scrapinghub/extruct

It might better fit parsing ld+json

answered Oct 10, 2021 at 14:05

FlorianLudwig

3242 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jan Over a year ago

Oh, I did not know about ::text. Better than fiddling with regular expressions, imo. +1

zoommer Over a year ago

this is perfect, thank you! If I may ask, is it even possible to parse specific JSON data only like that? or is it's a good way to parse it all and work with result JSON dict later in python where I extract what I need? thank you

FlorianLudwig Over a year ago

In this case: parse all of it. Source code will be cleaner and probably even faster. If you are on a resource constrained setting or deal with huge jsons there might be the use case for partial parsing.

Jan · Accepted Answer · 2021-10-10 14:06:04Z

0

You could take the response as a string and use a recursive regex on. Recursion is not supported by the original re module but by the newer regex one.
That said, a possible approach could be:

import regex

# code before 

some_json_string = response.css('div.rich-snippet script').get()
match = regex.search(r'\{(?:[^{}]*|(?R))+\}', some_json_string)

if match:
    relevant_json = match.group(0)
    # process it further here

See a demo on regex101.com for the expression.

Edit:

It seems that ::text is supported, so better use this answer instead.

answered Oct 10, 2021 at 14:06

Jan

43.3k11 gold badges57 silver badges87 bronze badges

1 Comment

zoommer Over a year ago

thank you, I will use the ::text, but your regex will come handy as well :] I will try both ways just to try and see

Collectives™ on Stack Overflow

regex that access json data from javascript html tag with scrapy

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related