0

I'm trying to use cssselect on some HTML page parsed by lxml, but I found that only one parser gives the expected result:

This works just fine:

lxml.html.fromstring("...").cssselect("div.foo")

This returns no results:

lxml.html.html5parser.fromstring("...").cssselect("div.foo")

What's the difference? And can I get cssselect to work with html5parser?

1 Answer 1

1

Please see these two answers about the reason:

How to remove namespace value from inside lxml.html.html5paser element tag

lxml html5parser ignores "namespaceHTMLElements=False" option

In short, the reason is that the parse from html5lib adds namespace html to the element tree while other parsers don't.

I think it should be a bug, from lxml side, maybe... To fix this:

import lxml.html.html5parser
from html5lib import HTMLParser
from html5lib.treebuilders.etree_lxml import TreeBuilder

parser = HTMLParser(tree=TreeBuilder, namespaceHTMLElements=False)
print(lxml.html.html5parser.fromstring("<div class=\"foo\"></div>", parser=parser))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.