How to parse html that includes javascript code

Question

How does one parse html documents which make heavy use of javascript? I know there are a few libraries in python which can parse static xml/html files and I'm basically looking for a programme or library (or even firefox plugin) which reads html+javascript, executes the javascript bit and outputs html code without javascript so it would look identical if displayed in a browser.

As a simple example

<a href="javascript:web_link(34, true);">link</a>

should be replaced by the appropriate value the javascript function returns, e.g.

<a href="http://www.example.com">link</a>

A more complex example would be a saved facebook html page which is littered with loads of javascript code.

Probably related to How to "execute" HTML+Javascript page with Node.js but do I really need Node.js and JSDOM? Also slightly related is Python library for rendering HTML and javascript but I'm not interested in rendering just the pure html output.

Either get a JavaScript runtime and sort something out with it, or analyse the code and work out what it's going to end up (strongly per-site configuration). — Chris Morgan
– Chris Morgan, Commented Aug 17, 2011 at 14:46

PabloG · Accepted Answer · 2011-08-22 20:15:20Z

You can use Selenium with python as detailed here

Example:

import xmlrpclib

# Make an object to represent the XML-RPC server.
server_url = "http://localhost:8080/selenium-driver/RPC2"
app = xmlrpclib.ServerProxy(server_url)

# Bump timeout a little higher than the default 5 seconds
app.setTimeout(15)

import os
os.system('start run_firefox.bat')

print app.open('http://localhost:8080/AUT/000000A/http/www.amazon.com/')
print app.verifyTitle('Amazon.com: Welcome')
print app.verifySelected('url', 'All Products')
print app.select('url', 'Books')
print app.verifySelected('url', 'Books')
print app.verifyValue('field-keywords', '')
print app.type('field-keywords', 'Python Cookbook')
print app.clickAndWait('Go')
print app.verifyTitle('Amazon.com: Books Search Results: Python Cookbook')
print app.verifyTextPresent('Python Cookbook', '')
print app.verifyTextPresent('Alex Martellibot, David Ascher', '')
print app.testComplete()

Community · Accepted Answer · 2020-06-20 09:12:55Z

2

From Mozilla Gecko FAQ:

Q. Can you invoke the Gecko engine from a Unix shell script? Could you send it HTML and get back a web page that might be sent to the printer?

A. Not really supported; you can probably get something close to what you want by writing your own application using Gecko's embedding APIs, though. Note that it's currently not possible to print without a widget on the screen to render to.

Embedding Gecko in a program that outputs what you want may be way too heavy, but at least your output will be as good as it gets.

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Aug 17, 2011 at 8:27

Jonas G. Drange

8,8332 gold badges29 silver badges40 bronze badges

1 Comment

Jonas G. Drange Over a year ago

Could also add this recipe: siliconforks.com/doc/parsing-javascript-with-spidermonkey

gliptak · Accepted Answer · 2013-10-31 01:34:16Z

0

PhantomJS can be loaded using Selenium

$ ipython

In [1]: from selenium import webdriver

In [2]: browser=webdriver.PhantomJS()

In [3]: browser.get('http://seleniumhq.org/')

In [4]: browser.title
Out[4]: u'Selenium - Web Browser Automation'

answered Oct 31, 2013 at 1:34

gliptak

3,6902 gold badges33 silver badges64 bronze badges

Collectives™ on Stack Overflow

How to parse html that includes javascript code

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related