7

Using Scrapy, how do I get value of a Javascript variable ....

Here is the my code ...

<script rel="bmc-data">
      var match = 'yes';
      var country = 'uk';
      var tmData = {
        "googleExperimentVariation": "1",
        "pageTitle": "Child Care",
        "page_type": "claimed",
        "company_state": "wyostate",
        "company_city": "mycity"
                   };
</script>

I want to check the value of page_type variable. If its "claimed" process the page, or else move on ....

I have already seen this and this

I have tried this ...

pattern = r'page_type = "(\w+)",'
response.xpath('//script[@rel="bmc-data"]').re(pattern)

but ofcourse this is not working, becuase I think my regex is wrong.

2 Answers 2

5

Your regex pattern is faulty here:

# you are looking for this bit: "page_type": "claimed",
re.findall('page_type": "(.+)"', html_body)
# ["claimed"]

Or in the context for scrapy Selectors in your case:

response.xpath('//script[@rel="bmc-data"]').re('page_type": "(.+)"')

If you need to parse more than one variable like this I recommend answer mentioned by Paul, since regex is not always as reliable as xml parsing.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, I got this error when I tried your solution ... AttributeError: 'function' object has no attribute 'findall'
@PuneetSharma it seems like you have some syntax issues, see my edit for a concrete example for your case.
5

I can suggest using js2xml for this (disclaimer: I wrote js2xml)

>>> import scrapy
>>> import js2xml
>>> html = '''<script rel="bmc-data">
...       var match = 'yes';
...       var country = 'uk';
...       var tmData = {
...         "googleExperimentVariation": "1",
...         "pageTitle": "Child Care",
...         "page_type": "claimed",
...         "company_state": "wyostate",
...         "company_city": "mycity"
...                    };
... </script>'''
>>> selector = scrapy.Selector(text=html)
>>> selector.xpath('//script/text()').extract_first()
u'\n      var match = \'yes\';\n      var country = \'uk\';\n      var tmData = {\n        "googleExperimentVariation": "1",\n        "pageTitle": "Child Care",\n        "page_type": "claimed",\n        "company_state": "wyostate",\n        "company_city": "mycity"\n                   };\n'
>>> jscode = selector.xpath('//script/text()').extract_first()
>>> jstree = js2xml.parse(jscode)
>>> print(js2xml.pretty_print(jstree))
<program>
  <var name="match">
    <string>yes</string>
  </var>
  <var name="country">
    <string>uk</string>
  </var>
  <var name="tmData">
    <object>
      <property name="googleExperimentVariation">
        <string>1</string>
      </property>
      <property name="pageTitle">
        <string>Child Care</string>
      </property>
      <property name="page_type">
        <string>claimed</string>
      </property>
      <property name="company_state">
        <string>wyostate</string>
      </property>
      <property name="company_city">
        <string>mycity</string>
      </property>
    </object>
  </var>
</program>

>>> jstree.xpath('//var[@name="tmData"]/object')[0]
<Element object at 0x7f0b0018f050>

>>> from pprint import pprint
>>> data = js2xml.jsonlike.make_dict(jstree.xpath('//var[@name="tmData"]/object')[0])
>>> pprint(data)
{'company_city': 'mycity',
 'company_state': 'wyostate',
 'googleExperimentVariation': '1',
 'pageTitle': 'Child Care',
 'page_type': 'claimed'}
>>> data['page_type']
'claimed'
>>> 

3 Comments

Thanks for your reply Paul, Using a library for this simple operation seems bit of an overkill ...
Depends on the use-case obviously. Personnally, if I can avoid writing regex, I prefer. Matter of taste maybe.
seems to be a line missing. It should have jstree = js2xml.parse(jscode)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.