0

I have a project using the python screen-scraping framework scrapy. I created a spider that loads all <script> tags and processes the second one. This is because within the test data I gathered, the data I need, was in the second <script> tag.

But now I have a problem, whereas some pages contain the data I want in some other script tags (#3 or #4). Further obstacle is that mostly the second line of the second javascript tag has the JSON I want. But depending on the page, this could also be the 3rd or the 4th line.

Consider this simple HTML file:

<html>
    <head>
        <title> Test </title>
    </head>

    <body>
        <p>
            This is a text
        </p>

        <script type="text/javascript">
            var myJSON = {
                a: "a",
                b: 42
            }
        </script>
    </body>
</html>

I can access myJSON.b and get 42 if I open this page in my browser (firefox) and go to the developer tools and console.log(myJSON.b) So my Question is: How can I extract JavaScript variable or JSON from a scrapy-fetched-page?

3
  • 1
    you would use Selenium to control real web browser which can run JavaScript. OR outdated PhantomJS. OR Splash which has even plugin for Scrapy: scrapy-splash Commented Oct 1, 2019 at 8:40
  • @furas I totally disagree. Selenium is above all a webtester, not a webcrawler. So it takes more time to load the page and for something useless because there are a lot of ways to extract json pattern without anything that scrapy. I mean by there I exclude scrapy-splash too. Commented Oct 1, 2019 at 10:10
  • 1
    duplicate of How to extract data from javascript in a json format? Commented Oct 1, 2019 at 15:05

1 Answer 1

2

I had run into a similar issue before and I solved it by extracting the text in the script tag using something like (based on your sample HTML file):

response.xpath('//script/text()')

After that I used a regular expression to extract the required data in JSON format. So, using the selector above and your sample HTML, something close to:

pattern = r'i-suck-at-regular-expressions'
json_data = response.xpath('//script/text()').re_first(pattern)

Next, you should be able to use the json library to load the data as a python dictionary like so:

json.loads(json_data)

And it should return something similar to:

{"a": "a", "b": 42}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.