Scraping ajax pages which return javascript files that generate html nodes

Question

Some pages do not return raw data (like json or xml or html) on ajax. Instead they use some framework like dojo where ajax calls return js files which somehow populate the html nodes.

I am wondering if there is a non Selenium strategy to scrapy data from these pages.

as far as i know you need a browser. if you do not want selenium try phantomjs or similar. Have you tried jeanphix.me/Ghost.py? — gosom
– gosom, Commented Dec 12, 2014 at 13:31
@gosom Phantomjs is pretty cool and works well in my current implementation locally. the downside is that it is very slow. I also had problems deploying the code to heroku, so I'm wondering if there's something better. — yayu
– yayu, Commented Dec 12, 2014 at 13:35

Community · Accepted Answer · 2017-05-23 10:33:38Z

1

Alternatively to the selenium or webkit based approach, you can parse the javascript with a javascript code parser, like slimit. It definitely raises the complexity and reliability of the web-scraping since you go down to a bare hardcore metal with it - think about it as a "white box" approach as opposed to selenium based high-level "black box" one.

Here's the answer I've given for an exact same topic/problem you are asking about:

Having trouble accessing xpath attribute with scrapy

It involves the use of slimit to grab an object from the javascript code, loading it to a python data structure via json module and parsing the HTML inside with BeautifulSoup parser.

edited May 23, 2017 at 10:33

CommunityBot

11 silver badge

answered Dec 12, 2014 at 14:21

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scraping ajax pages which return javascript files that generate html nodes

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related