javascript-aware html parser for Python ~

Question

<html>
<head>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
</head>
<body>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
<div><a href="http://www.google.com">f*** js</a></div>
</body>
</html>

I want use xpath to catch all lable object in the html page above...

In [1]: import lxml.html as H

In [2]: f = open("test.html","r")

In [3]: c = f.read()

In [4]: doc = H.document_fromstring(c)

In [5]: doc.xpath('//a')
Out[5]: [<Element a at a01d17c>]

In [6]: a = doc.xpath('//a')[0]

In [7]: a.getparent()
Out[7]: <Element div at a01d41c>

I only get one don't generate by js～ but firefox xpath checker can find all lable!?

https://i.sstatic.net/0hSug.png

how to do that??? thx~!

<html>
<head>
</head>
<body>
<script language="javascript">
function over(){
a.innerHTML="mouse me"
}
function out(){
a.innerHTML="<a href='http://www.google.com'>google</a>"
}
</script>
<body><li id="a"onmouseover="over()" onmouseout="out()">mouse me</li>
</body>
</html>

You will have to parse and interpret the js before parsing the HTML. Have you seen crummy.com/software/BeautifulSoup? — Paulo Scardine
– Paulo Scardine, Commented Dec 28, 2010 at 1:10
Your javascript, as given, makes no sense - you are writing links into the document's head? From lxml's point of view, anything in document.write is a string constant, not to be parsed. — Hugh Bothwell
– Hugh Bothwell, Commented Dec 28, 2010 at 1:24
BTW, document.write() is not allowed in XML documents. You must use the DOM API. — Keith
– Keith, Commented Dec 28, 2010 at 1:26
I think the title should be "javascript-aware html parser for Python".. — redShadow
– redShadow, Commented Dec 28, 2010 at 1:45

Community · Accepted Answer · 2017-05-23 12:11:50Z

1

Not a clue about javascript-aware parser in python but you can use ANTLR to do the job. The idea is not mine so I'm leaving you the link.

It's actually quite cool because you can optimize your parser to selectively choose what instruction needs to be parsed (and executed).

edited May 23, 2017 at 12:11

CommunityBot

11 silver badge

answered Dec 28, 2010 at 2:03

dierre

7,22012 gold badges80 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

redShadow Over a year ago

Nice! And from the same question you linked, pypi.python.org/pypi/python-spidermonkey seems to be consideration worth as well..

Paulo Scardine · Accepted Answer · 2010-12-28 01:26:56Z

0

In Java there is Cobra. I don't know any Javascript-aware HTML parser for Python.

answered Dec 28, 2010 at 1:26

Paulo Scardine

78.2k12 gold badges134 silver badges153 bronze badges

Comments

redShadow · Accepted Answer · 2010-12-28 02:05:28Z

0

Searching google for "javascript standalone runtime", I found jslibs: a "standalone JavaScript development runtime environment for using JavaScript as a general-purpose scripting language", based on "SpiderMonkey library that is Gecko's JavaScript engine".

Sounds great! I haven't tested yet, but it seems like this will allow you to run the javascript code you find in the page. I don't know how much it will be tricky, though..

answered Dec 28, 2010 at 2:05

redShadow

6,7872 gold badges34 silver badges34 bronze badges

1 Comment

Keith Over a year ago

Not quite... it's just the language bindings, but doesn't have the DOM API. Most real-world javascript still won't work in it. By the time you add all the parts you need, you will have... a browser. Or, the closest thing I know if is HtmlUnit.

Collectives™ on Stack Overflow

javascript-aware html parser for Python ~

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related