3

I have an element in a page that looks like this:

<a id="cid-694094:Comment:188384" name="694094:Comment:188384"></a>

If you do document.cssselect("#cid-694094:Comment:188384") you will get:

lxml.cssselect.ExpressionError: The psuedo-class Symbol(u'Comment', 12) is unknown

The solution for that is handled in this question (the person was using Java).

However, when I try that in Python as such:

document.cssselect(r"#cid-694094\:Comment\:188384")

I get:

lxml.cssselect.SelectorSyntaxError: Bad symbol 'cid-694094\': 'unicodeescape' codec can't decode byte 0x5c in position 10: \ at end of string at [Token(u'#', 0)] -> None

The reason for that and a proposed solution can be found in this question. If I understand it correctly I should be doing:

document.cssselect(r"#cid-694094\\:Comment\\:188384")

But this still doesn't work. Instead I once again get:

lxml.cssselect.ExpressionError: The psuedo-class Symbol(u'Comment\', 14) is unknown

Can anybody tell me what I'm doing wrong?

Try it yourself using:

import lxml.html
document = lxml.html.fromstring(
    '<a id="cid-694094:Comment:188384" name="694094:Comment:188384"></a>'
)
document.cssselect(r"#cid-694094\:Comment\:188384")
1
  • That's odd, I swear StackOverflow is collapsing backward slashes in the last exception. Commented Dec 13, 2011 at 12:16

2 Answers 2

4

Isn't : not allowed in css for id or class?

Here is a work-around:

document.xpath('//a[@id="cid-694094:Comment:188384"]')
Sign up to request clarification or add additional context in comments.

5 Comments

I'm not sure if it is or is not allowed. But the question I linked to earlier says you can escape the colon in your selector and it should work. Your propose a pretty good work around, but, I assume it would be slower than an actual CSS selector by ID? Because this will have to check all A element right? Maybe I can use getElementById...
Ah, with that link, now I'm pretty sure that it is not allowed. The HTML isn't under my control though, I just scrape it, so I'll just have to work around it.
Actually csselector is converted to xpath with lxml.cssselect.css_to_xpath()
Very enlightening! It turns out your work-around is almost exactly what cssselect("#id") would do. Thanks.
1

: is normally not allowed in ID selectors, and this is indeed the correct way to escape it:

document.cssselect(r"#cid-694094\:Comment\:188384")

However the selector parser in was really broken until recently. (It did not really implement backslash-escapes.) I fixed this in cssselect 0.7 which is now an independent project, extracted from lxml.

http://packages.python.org/cssselect/

The "new" way to use it is a bit more verbose:

import cssselect
document.xpath(cssselect.HTMLTranslator().css_to_xpath('#cid-694094\:Comment\:188384'))

lxml 2.4 (not released yet) will use the new cssselect so the simpler syntax will work too.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.