3

I want to convert some web pages with javascript to plain html, and I found there several ways(pls tell me if I'm wrong):

  1. Use Jython, an example: http://blog.databigbang.com/web-scraping-ajax-and-javascript-sites/
  2. Use Java together with htmlunit
  3. Use a proxy, an example: http://grep.codeconsult.ch/2007/02/24/crowbar-scrape-javascript-generated-pages-via-gecko-and-rest/
  4. Use python together with qt or PyV8

Because I want to make a tiny tool to meet my request, and I thought it somewhat complicated to install V8 and qt, although python is my first choice.

So I tried to make a proxy with gecko, but it seems need a DISPLAY which I can not afford in a remote Linux server.

Now I am trying to use jython, but it seems there is no simple way to just convert a whole page to plain html.

Actually, I want to ask is there a way to convert a web page contains javascript to plain html, just like the brower does. Can node.js do this job?

5
  • Render it with Selenium/Ghost.py and dump the DOM into an HTML file. Commented Oct 21, 2013 at 3:18
  • What are you trying to accomplish, out of curiosity? Commented Oct 21, 2013 at 3:18
  • yeah, that... do you want to remove all javascript from a page? that can be done easily with a regular expression... Commented Oct 21, 2013 at 3:19
  • @JoshuaSmock Just trying to get the content generated by javascript Commented Oct 21, 2013 at 6:57
  • @NicolásStraubValdivieso I am trying to extract the content generated by js, so can not just remove them. Commented Oct 21, 2013 at 6:59

1 Answer 1

2

I've recently built a server on top of PhantomJS that does this. I highly recommend this route.

http://phantomjs.org/

Basically, you write a quick script that has PhantomJS run the page, and configure a trigger method that lets you know the page is finished and sends the data off. My version used the built-in HTTP server, so PhantomJS easily served up the results on its own. This takes about 15 lines of code to do. (Sorry, can't paste it here... wrote it on work time. But, check out the example on their home page. It's almost complete!)

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, phantomjs resovles my problem.
Any chances of bringing phantomjs.org online again?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.