Fetch page with Scrapy, execute JS and extract variable

Question

I have a project using the python screen-scraping framework scrapy. I created a spider that loads all <script> tags and processes the second one. This is because within the test data I gathered, the data I need, was in the second <script> tag.

But now I have a problem, whereas some pages contain the data I want in some other script tags (#3 or #4). Further obstacle is that mostly the second line of the second javascript tag has the JSON I want. But depending on the page, this could also be the 3rd or the 4th line.

Consider this simple HTML file:

<html>
    <head>
        <title> Test </title>
    </head>

    <body>
        <p>
            This is a text
        </p>

        <script type="text/javascript">
            var myJSON = {
                a: "a",
                b: 42
            }
        </script>
    </body>
</html>

I can access myJSON.b and get 42 if I open this page in my browser (firefox) and go to the developer tools and console.log(myJSON.b) So my Question is: How can I extract JavaScript variable or JSON from a scrapy-fetched-page?

you would use Selenium to control real web browser which can run JavaScript. OR outdated PhantomJS. OR Splash which has even plugin for Scrapy: scrapy-splash — furas
– furas, Commented Oct 1, 2019 at 8:40
@furas I totally disagree. Selenium is above all a webtester, not a webcrawler. So it takes more time to load the page and for something useless because there are a lot of ways to extract json pattern without anything that scrapy. I mean by there I exclude scrapy-splash too. — AvyWam
– AvyWam, Commented Oct 1, 2019 at 10:10
duplicate of How to extract data from javascript in a json format? — Georgiy
– Georgiy, Commented Oct 1, 2019 at 15:05

Eb J · Accepted Answer · 2019-10-01 09:27:17Z

2

I had run into a similar issue before and I solved it by extracting the text in the script tag using something like (based on your sample HTML file):

response.xpath('//script/text()')

After that I used a regular expression to extract the required data in JSON format. So, using the selector above and your sample HTML, something close to:

pattern = r'i-suck-at-regular-expressions'
json_data = response.xpath('//script/text()').re_first(pattern)

Next, you should be able to use the json library to load the data as a python dictionary like so:

json.loads(json_data)

And it should return something similar to:

{"a": "a", "b": 42}

answered Oct 1, 2019 at 9:27

Eb J

2382 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Fetch page with Scrapy, execute JS and extract variable

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related