Python: CSS Selector to use inside lxml.cssselect

Question

I am trying to parse the given below html code using lxml.html and using CSSSelector instead of XPath.

link = doc.cssselect('html body div.results dl dt a)

the above code is giving me content-1 and content-2 as output but my desired output is link 1 link 2. So I replaced my code with

link = doc.cssselect('html body div.results dl dt a[href]')

but still am getting the same output. So my question is what's the proper CSS selector to get href attribute.

             <div class = "results">
                     <div> some tags here </div>
                        <dl> 
                              <dt title = "My Title 1" style = "background: transparent url('/img/accept.png') no-repeat right center">
                              <a href = "/link 1"> content-1</a> 
                              </dt>
                       </dl>

                      <dl>
                             <dt title = "My Title 2" style = "background: transparent url('/img/accept.png') no-repeat right center">
                             <a href = "/link 2">content-2</a>
                             </dt>
                     </dl>
            </div>

brandizzi · Accepted Answer · 2019-05-07 20:31:40Z

10

I believe you cannot get the attribute value through CSS selectors. You should get the elements...

>>> elements = doc.cssselect('div.results dl dt a')

...and then get the attributes from them:

>>> for element in elements:
...     print element.get('href')
... 
/link 1
/link 2

Of course, list comprehensions are your friends:

>>> [element.get('href') for element in elements]
['/link 1', '/link 2']

Since you cannot update properties of attributes in CSS, I believe there is no sense on getting them through CSS selectors. You can "mention" attributes in CSS selectors to retrieve only to match their elements. ~~However, is is just cogitation and I may be wrong; if I am, please someone correct me :)~~ Well, @Tim Diggs confirms my hypothesis below :)

edited May 7, 2019 at 20:31

answered Dec 28, 2011 at 13:53

brandizzi

27.4k9 gold badges111 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Tim Diggins Over a year ago

@brandizzi, you're right - you can only select elements in css, not attributes -- the brackets are for filtering which elements to select (but not a bad idea to select only <a> tags without href attributes (which is what a[href] does).

Tim Diggins Over a year ago

@RanRag, you should tick brandizzi's answer as correct even if you didn't need it in the end.

RanRag Over a year ago

i was going to tick it but you can only accept an answer after a certain period of time( i believe its arnd 5 mins)

Singletoned Over a year ago

For anyone coming to this later, you CAN get the attribute value through CSS selectors (using a psuedo-selector). doc.cssselect('div.results dl dt a::attr(href)')

brandizzi Over a year ago

@Singletoned oooh, that's cool! Yet I could not get it to work with lxml and cssselect. When I tried it with lxml 4.9.2 (the most recent I've found) I got the error cssselect.xpath.ExpressionError: Pseudo-elements are not supported. Am I missing something? Is this supported on an unreleased version of lxml? Thanks!

Tim Diggins · Accepted Answer · 2011-12-28 13:55:59Z

4

You need to get the attribute on the result of cssselect (it always returns the element, never an attribute):

firstly, I'm not sure about doc.cssselect (but maybe this is your own function?)

lxml.cssselect is normally used:

from lxml.cssselect import CSSSelector
sel = CSSSelector('html body div.results dl dt a[href]')

then, assuming you've already got a doc

links = []
for a_href in sel(doc):
    links.append(a_href.get('href'))

or the more succinct:

links = [a_href.get('href') for a_href in doc.cssselect('html body div.results dl dt a[href]')]

answered Dec 28, 2011 at 13:55

Tim Diggins

4,5663 gold badges32 silver badges50 bronze badges

1 Comment

RanRag Over a year ago

basically doc is equivalent to doc=lxml.html.fromstring(content) where content is my html data from urllib and read functions

Drarok · Accepted Answer · 2014-01-20 10:55:46Z

4

I have successfully used

#element-id ::attr(value)

To get the "value" attribute for HTML elements.

answered Jan 20, 2014 at 10:55

Drarok

3,8494 gold badges36 silver badges50 bronze badges

Comments

Forrest Hu · Accepted Answer · 2015-04-17 20:52:17Z

0

lxml cssselector works with attributes selection. Below code can select src attribute from HTML script element.

   select = cssselect.CSSSelector("script[src]")
   links = [ el.get('src') for el in select(dochtml) ]
   links=iter(links)
   for n, l in enumerate(links):
       print n, l

answered Apr 17, 2015 at 20:52

Forrest Hu

614 bronze badges

Collectives™ on Stack Overflow

Python: CSS Selector to use inside lxml.cssselect

4 Answers 4

5 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related