Scraping from javascript using Scrapy

Question

I need to scrape the content with javascript tag using scrapy as follows:

<script type='text/javascript' id='script-id'> attribute={"pid":"123","title":"abc","url":"http://example.com","date":"2014-07-31 14:56:39 CDT","channels":["test"],"tags":[],"authors":["james Catcher"]};</script>

I can extract the content using xpath

response.xpath('id("script-id")//text()').extract()

Output

[u'\nattribute = {"pid":"123","title":"abc","url":"http:/example.com","date":"2014-07-30 15:34:10 ","channels":["test"],"tags":[],"authors":["james Watt"]};\n(function( ){\n var s = document.createElement(\'script\');\n s.async = true;\n s.type = \'text/javascript\';\n s.src = document.location.protocol + \'//d8rk54i4mohrb. cloudfront.net/js/reach.js\';\n (document.getElementsByTagName(\'head\')[0] || document.getElementsByTagName(\'body\')[0]).appendChild(s);\n})();\n'']

How can I get each values using xpath?

@Artjom B. now i edited the question, how can i get values of pid, title etc. — Anish
– Anish, Commented Aug 1, 2014 at 10:58

Arthur Burkhardt · Accepted Answer · 2014-08-01 11:20:03Z

2

This is json, so you can first extract it from the string, then load it with json

In [1]: import json

In [2]: sample_string = [u'\n attribute={"pid":"123","title":"abc",'
        +'"url":"http:/example.com","date":"2014-07-30 15:34:10 ",'
        +'"channels":["test"],"tags":[],"authors":["james Watt"]}'][0]

In [3]: data = json.loads(sample_string[12:])

In [4]: data
Out[4]:
{u'authors': [u'james Watt'],
u'channels': [u'test'],
u'date': u'2014-07-30 15:34:10 ',
u'pid': u'123',
u'tags': [],
u'title': u'abc',
u'url': u'http:/example.com'}

In [5]: data['authors']
Out[5]: [u'james Watt']

Alternatively, you can also load a javascript engine like PyV8 to interpret those variables.

edited Aug 1, 2014 at 11:20

answered Aug 1, 2014 at 7:29

Arthur Burkhardt

7004 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Anish Over a year ago

The output also has following:[u'\nscript-id = {"pid":"123","title":"abc","url":"http:/example.com","date":"2014-07-30 15:34:10 ","channels":["test"],"tags":[],"authors":["james Watt"]}];\n(function( ){\n var s = document.createElement(\'script\');\n s.async = true;\n s.type = \'text/javascript\';\n s.src = document.location.protocol + \'//d8rk54i4mohrb. cloudfront.net/js/reach.js\';\n (document.getElementsByTagName(\'head\')[0] || document.getElementsByTagName(\'body\')[0]).appendChild(s);\n})();\n']

Artjom B. Over a year ago

I think it is better to use output.split("attribute=")[1] instead of splicing ([13:]). Or even better output.split("=")[1]

Arthur Burkhardt Over a year ago

Yep, completely fine as well. Whatever works based on the type of strings encountered.

Collectives™ on Stack Overflow

Scraping from javascript using Scrapy

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related