2
<html>
<head>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
</head>
<body>
    <script type="text/javascript">
    document.write('<a href="http://www.google.com">f*** js</a>');
    document.write("f*** js!");
    </script>
<div><a href="http://www.google.com">f*** js</a></div>
</body>
</html>

I want use xpath to catch all lable object in the html page above...

In [1]: import lxml.html as H

In [2]: f = open("test.html","r")

In [3]: c = f.read()

In [4]: doc = H.document_fromstring(c)

In [5]: doc.xpath('//a')
Out[5]: [<Element a at a01d17c>]

In [6]: a = doc.xpath('//a')[0]

In [7]: a.getparent()
Out[7]: <Element div at a01d41c>

I only get one don't generate by js~ but firefox xpath checker can find all lable!?

https://i.sstatic.net/0hSug.png

how to do that??? thx~!

<html>
<head>
</head>
<body>
<script language="javascript">
function over(){
a.innerHTML="mouse me"
}
function out(){
a.innerHTML="<a href='http://www.google.com'>google</a>"
}
</script>
<body><li id="a"onmouseover="over()" onmouseout="out()">mouse me</li>
</body>
</html>
5
  • 3
    removed profanity as it served nothing.. Commented Dec 28, 2010 at 1:05
  • 1
    You will have to parse and interpret the js before parsing the HTML. Have you seen crummy.com/software/BeautifulSoup? Commented Dec 28, 2010 at 1:10
  • Your javascript, as given, makes no sense - you are writing links into the document's head? From lxml's point of view, anything in document.write is a string constant, not to be parsed. Commented Dec 28, 2010 at 1:24
  • BTW, document.write() is not allowed in XML documents. You must use the DOM API. Commented Dec 28, 2010 at 1:26
  • I think the title should be "javascript-aware html parser for Python".. Commented Dec 28, 2010 at 1:45

3 Answers 3

1

Not a clue about javascript-aware parser in python but you can use ANTLR to do the job. The idea is not mine so I'm leaving you the link.

It's actually quite cool because you can optimize your parser to selectively choose what instruction needs to be parsed (and executed).

Sign up to request clarification or add additional context in comments.

1 Comment

Nice! And from the same question you linked, pypi.python.org/pypi/python-spidermonkey seems to be consideration worth as well..
0

In Java there is Cobra. I don't know any Javascript-aware HTML parser for Python.

Comments

0

Searching google for "javascript standalone runtime", I found jslibs: a "standalone JavaScript development runtime environment for using JavaScript as a general-purpose scripting language", based on "SpiderMonkey library that is Gecko's JavaScript engine".

Sounds great! I haven't tested yet, but it seems like this will allow you to run the javascript code you find in the page. I don't know how much it will be tricky, though..

1 Comment

Not quite... it's just the language bindings, but doesn't have the DOM API. Most real-world javascript still won't work in it. By the time you add all the parts you need, you will have... a browser. Or, the closest thing I know if is HtmlUnit.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.