Finding html element with class using lxml

Question

I've searched everywhere and what I most found was doc.xpath('//element[@class="classname"]'), but this does not work no matter what I try.

code I'm using

import lxml.html

def check():
    data = urlopen('url').read();
    return str(data);

doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='test']")
print(el)

It simply prints an empty list.

Edit: How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)

Here's the exact code I'm using.

import lxml.html
from urllib.request import urlopen
import sys

def check():
    data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
    return data.decode('utf-8', 'ignore');


doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='channel']")
print(el)

'url' is a 3-character string. It is not a HTML file.

mzjn
– mzjn

2011-11-22 22:43:45 +00:00
Commented Nov 22, 2011 at 22:43 — mzjn
– mzjn, Commented Nov 22, 2011 at 22:43
Obviously I did that instead of posting the real url.

Uriah
– Uriah

2011-11-23 16:46:56 +00:00
Commented Nov 23, 2011 at 16:46 — Uriah
– Uriah, Commented Nov 23, 2011 at 16:46
Please provide a SSCCE.

mzjn
– mzjn

2011-11-23 18:46:48 +00:00
Commented Nov 23, 2011 at 18:46 — mzjn
– mzjn, Commented Nov 23, 2011 at 18:46

mzjn · Accepted Answer · 2011-11-24 21:12:55Z

36

The TopGear page that you use for testing doesn't have any <div class="channel"> elements. But this works (for example):

el = doc.xpath("//div[@class='channel-title-container']")

Or this:

el = doc.xpath("//div[@class='a yb xr']")

To find <div> elements with a class attribute that contains the string channel, you could use

el = doc.xpath("//div[contains(@class, 'channel')]")

edited Nov 24, 2011 at 21:12

answered Nov 24, 2011 at 17:16

mzjn

51.5k16 gold badges139 silver badges265 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

mzjn Over a year ago

branded-page channel is not the same as channel.

Uriah Over a year ago

But, according to css, that element has two classes, branded-page and channel. So why wouldn't it?

mzjn Over a year ago

Yes, according to CSS there are two classes. But XPath does not know about the rules of CSS. To XPath, branded-page channel is just a string with no special meaning.

Uriah Over a year ago

That's actually helpful, thanks. Just as a test, I tried to get an element on this page, and it's not working either. This is really starting to piss me off. el = doc.xpath('//a[@class="vote-accepted-off"]') It appears that it doesn't like to find elements that don't have child elements.

Efe Over a year ago

Just to complete your answer, we can aslo use not() for negation.Example: el = doc.xpath("//div[contains(@class, 'channel') and not(contains(@class, 'disabled'))]")

dmzkrsk · Accepted Answer · 2012-01-26 02:56:39Z

3

You can use lxml.cssselect to simplify class and id request: http://lxml.de/dev/cssselect.html

answered Jan 26, 2012 at 2:56

dmzkrsk

2,1352 gold badges21 silver badges34 bronze badges

Comments

Andrei.Danciuc · Accepted Answer · 2019-04-28 14:40:22Z

2

HTML uses classes (a lot), which makes them convenient to hook XPath queries. However XPath has no knowledge/support of CSS classes (or even space-separated lists) which makes classes a pain in the ass to check: the canonically correct way to look for elements having a specific class is:

//*[contains(concat(' ', normalize-space(@class), ' '), '$className')]

In your case this is

el = doc.xpath(
    "//div[contains(concat(' ', normalize-space(@class), ' '), 'channel')]"
)
# print(el)
# [<Element div at 0x7fa44e31ccc8>, <Element div at 0x7fa44e31c278>, <Element div at 0x7fa44e31cdb8>]

or use own XPath function hasclass(*classes)

def _hasaclass(context, *cls):
    return "your implementation ..." 

xpath_utils = etree.FunctionNamespace(None)
xpath_utils['hasaclass'] = _hasaclass

el = doc.xpath("//div[hasaclass('channel')]")

answered Apr 28, 2019 at 14:40

Andrei.Danciuc

1,20713 silver badges28 bronze badges

1 Comment

Daniel Haley Over a year ago

The second arg in contains() should also have spaces added (like ' $className ' and ' channel '). Otherwise you'll still match classes like somechannel.

Collectives™ on Stack Overflow

Finding html element with class using lxml

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related