3

I'm trying to extract (what appears to be) JSON data from within an HTML script. The HTML script looks like this on the site:

<script>
  $(document).ready(function(){
    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
  });
</script>

I'd like to pull out the following:

[{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]

Using the following code, I'm able to get the full script.

    def parse(self, response):
        print response.xpath('/html/body/script[2]').extract()

Is there a simple way to then extract the values for "id", "name", etc. from that script. Or, is there a more direct way by modifying the xpath? I can't seem to go any deeper on the xpath using firebug.

1

4 Answers 4

3

You can use js2xml for this.

To illustrate, first, let's create a Scrapy selector with your sample HTML, and grab the JavaScript statements:

>>> import scrapy
>>> sample = '''<script>
...   $(document).ready(function(){
...     var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);
...     var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});
...   });
... </script>'''
>>> selector = scrapy.Selector(text=sample, type='html')
>>> selector.xpath('//script//text()').extract_first()
u'\n  $(document).ready(function(){\n    var terms    = new Verba.Compare.Collections.Terms([{"id":"6436","name":"SUMMER 16","inquiry":true,"ordering":true},{"id":"6517","name":"FALL 16","inquiry":true,"ordering":true}]);\n    var view     = new Verba.Compare.Views.CourseSelector({el: "body", terms: terms});\n  });\n'

Then we can parse the JavaScript code with js2xml. You get an lxml tree back:

>>> import js2xml
>>> jssnippet = selector.xpath('//script//text()').extract_first()
>>> jstree = js2xml.parse(jssnippet)
>>> jstree
<Element program at 0x7fc7c6bae1b8>

What does the tree look like? It's pretty verbose:

>>> print(js2xml.pretty_print(jstree))
<program>
  <functioncall>
    <function>
      <dotaccessor>
        <object>
          <functioncall>
            <function>
              <identifier name="$"/>
            </function>
            <arguments>
              <identifier name="document"/>
            </arguments>
          </functioncall>
        </object>
        <property>
          <identifier name="ready"/>
        </property>
      </dotaccessor>
    </function>
    <arguments>
      <funcexpr>
        <identifier/>
        <parameters/>
        <body>
          <var name="terms">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Collections"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="Terms"/>
                </property>
              </dotaccessor>
              <arguments>
                <array>
                  <object>
                    <property name="id">
                      <string>6436</string>
                    </property>
                    <property name="name">
                      <string>SUMMER 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                  <object>
                    <property name="id">
                      <string>6517</string>
                    </property>
                    <property name="name">
                      <string>FALL 16</string>
                    </property>
                    <property name="inquiry">
                      <boolean>true</boolean>
                    </property>
                    <property name="ordering">
                      <boolean>true</boolean>
                    </property>
                  </object>
                </array>
              </arguments>
            </new>
          </var>
          <var name="view">
            <new>
              <dotaccessor>
                <object>
                  <dotaccessor>
                    <object>
                      <dotaccessor>
                        <object>
                          <identifier name="Verba"/>
                        </object>
                        <property>
                          <identifier name="Compare"/>
                        </property>
                      </dotaccessor>
                    </object>
                    <property>
                      <identifier name="Views"/>
                    </property>
                  </dotaccessor>
                </object>
                <property>
                  <identifier name="CourseSelector"/>
                </property>
              </dotaccessor>
              <arguments>
                <object>
                  <property name="el">
                    <string>body</string>
                  </property>
                  <property name="terms">
                    <identifier name="terms"/>
                  </property>
                </object>
              </arguments>
            </new>
          </var>
        </body>
      </funcexpr>
    </arguments>
  </functioncall>
</program>

You can use your XPath skills to point to the JavaScript array (you want the 1st argument of the "dot" accessor for the new contruct assigned to var terms):

>>> jstree.xpath('//var[@name="terms"]')
[<Element var at 0x7fc7c565e638>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')
[<Element array at 0x7fc7c565e5a8>]
>>> jstree.xpath('//var[@name="terms"]/new/arguments/*')[0]
<Element array at 0x7fc7c565e5a8>

Finally, now that you have the <array> element, you can pass it to js2xml.jsonlike.make_dict() to get a nice Python object to work with (make_dict is kinda misnamed):

>>> js2xml.jsonlike.make_dict(jstree.xpath('//var[@name="terms"]/new/arguments/*')[0])
[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}]
>>> 

Note: you can also use the shortcut js2xml.jsonlike.getall() to fetch everything that looks like a Python dict or list (you get 2 lists, you're interested in the 1st one):

>>> js2xml.jsonlike.getall(jstree)
[[{'ordering': True, 'inquiry': True, 'id': '6436', 'name': 'SUMMER 16'}, {'ordering': True, 'inquiry': True, 'id': '6517', 'name': 'FALL 16'}], {'el': 'body', 'terms': 'terms'}]
Sign up to request clarification or add additional context in comments.

Comments

1

chompjs provides an API to parse JavaScript objects into a dict.

For example, if the JavaScript code contains var data = {field: "value", secondField: "second value"}; you can extract that data as follows:

import chompjs
javascript = response.css('script::text').get()
data = chompjs.parse_js_object(javascript)

The final result is {'field': 'value', 'secondField': 'second value'} a

Comments

0

I would extract it using a regex, something like:

response.xpath('/html/body/script[2]').re_first('\((\[.*\])\)')

2 Comments

It works! After a bit more research I figured out a similar method to what you just posted. However, yours is much cleaner. Thank you.
You are most welcome. Please feel free to mark this as answer, if it has solved your problem.
-1

You can't go "deeper" because that element's contents are just text. It's not too hard to read out the JSON from the JavaScript:

line = javascript.strip().splitlines()[1]
the_json = line.split('(', 1)[1].split(')', 1)[0]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.