Scrapy: extract JSON from within HTML script

Question

I'm trying to extract (what appears to be) JSON data from within an HTML script. The HTML script looks like this on the site:

<script>
  $(document).ready(function(){
    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
  });
</script>

I'd like to pull out the following:

[{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]

Using the following code, I'm able to get the full script.

    def parse(self, response):
        print response.xpath('/html/body/script[2]').extract()

Is there a simple way to then extract the values for "id", "name", etc. from that script. Or, is there a more direct way by modifying the xpath? I can't seem to go any deeper on the xpath using firebug.

Would this help : stackoverflow.com/questions/13323976/…

Ram K
– Ram K

2016-07-19 23:58:25 +00:00
Commented Jul 19, 2016 at 23:58 — Ram K
– Ram K, Commented Jul 19, 2016 at 23:58

paul trmbrth · Accepted Answer · 2016-07-20 09:17:37Z

You can use js2xml for this.

To illustrate, first, let's create a Scrapy selector with your sample HTML, and grab the JavaScript statements:

>>> import scrapy
>>> sample = '''<script>
...   $(document).ready(function(){
...     var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
...     var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
...   });
... </script>'''
>>> selector = scrapy.Selector(text=sample, type='html')
>>> selector.xpath('//script//text()').extract_first()
u'\n  $(document).ready(function(){\n    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);\n    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});\n  });\n'

Then we can parse the JavaScript code with js2xml. You get an lxml tree back:

>>> import js2xml
>>> jssnippet = selector.xpath('//script//text()').extract_first()
>>> jstree = js2xml.parse(jssnippet)
>>> jstree
<Element program at 0x7fc7c6bae1b8>

What does the tree look like? It's pretty verbose:

>>> print(js2xml.pretty_print(jstree))
<program>
  <functioncall>
    <function>
      <dotaccessor>
        <object>
          <functioncall>
            <function>
              <identifier name="$"/>
            </function>
            <arguments>
              <identifier name="document"/>
            </arguments>
          </functioncall>
        </object>
        <property>
          <identifier name="ready"/>
        </property>
      </dotaccessor>
    </function>
    <arguments>
      <funcexpr>
        <identifier/>
        <parameters/>
        <body>
          <var name="terms">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Collections"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="Terms"/>
                </property>
              </dotaccessor>
              <arguments>
                <array>
                  <object>
                    <property name="id">
                      <string>6436</string>
                    </property>
                    <property name="name">
                      <string>SUMMER 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                  <object>
                    <property name="id">
                      <string>6517</string>
                    </property>
                    <property name="name">
                      <string>FALL 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                </array>
              </arguments>
            </new>
          </var>
          <var name="view">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Views"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="CourseSelector"/>
                </property>
              </dotaccessor>
              <arguments>
                <object>
                  <property name="el">
                    <string>body</string>
                  </property>
                  <property name="terms">
                    <identifier name="terms"/>
                  </property>
                </object>
              </arguments>
            </new>
          </var>
        </body>
      </funcexpr>
    </arguments>
  </functioncall>
</program>

You can use your XPath skills to point to the JavaScript array (you want the 1st argument of the "dot" accessor for the new contruct assigned to var terms):

>>> jstree.xpath('//var[@name="terms"]')
[<Element var at 0x7fc7c565e638>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')
[<Element array at 0x7fc7c565e5a8>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')[0]
<Element array at 0x7fc7c565e5a8>

Finally, now that you have the <array> element, you can pass it to js2xml.jsonlike.make_dict() to get a nice Python object to work with (make_dict is kinda misnamed):

>>> js2xml.jsonlike.make_dict(jstree.xpath('//var[@name="terms"]/new/arguments/*')[0])
[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}]
>>>

Note: you can also use the shortcut js2xml.jsonlike.getall() to fetch everything that looks like a Python dict or list (you get 2 lists, you're interested in the 1st one):

>>> js2xml.jsonlike.getall(jstree)
[[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}], {'el': 'body', 'terms': 'terms'}]

Jerry An · Accepted Answer · 2020-12-13 05:56:19Z

1

chompjs provides an API to parse JavaScript objects into a dict.

For example, if the JavaScript code contains var data = {field: "value", secondField: "second value"}; you can extract that data as follows:

import chompjs
javascript = response.css('script::text').get()
data = chompjs.parse_js_object(javascript)

The final result is {'field': 'value', 'secondField': 'second value'} a

answered Dec 13, 2020 at 5:56

Jerry An

1,4821 gold badge13 silver badges22 bronze badges

Comments

Wilfredo · Accepted Answer · 2016-07-20 03:23:11Z

0

I would extract it using a regex, something like:

response.xpath('/html/body/script[2]').re_first('\((\[.*\])\)')

answered Jul 20, 2016 at 3:23

Wilfredo

1,5481 gold badge9 silver badges9 bronze badges

2 Comments

ridingsolo Over a year ago

It works! After a bit more research I figured out a similar method to what you just posted. However, yours is much cleaner. Thank you.

Wilfredo Over a year ago

You are most welcome. Please feel free to mark this as answer, if it has solved your problem.

Blender · Accepted Answer · 2016-07-20 00:10:28Z

-1

You can't go "deeper" because that element's contents are just text. It's not too hard to read out the JSON from the JavaScript:

line = javascript.strip().splitlines()[1]
the_json = line.split('(', 1)[1].split(')', 1)[0]

answered Jul 20, 2016 at 0:10

Blender

300k55 gold badges462 silver badges511 bronze badges

Collectives™ on Stack Overflow

Scrapy: extract JSON from within HTML script

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related