17

I've searched everywhere and what I most found was doc.xpath('//element[@class="classname"]'), but this does not work no matter what I try.

code I'm using

import lxml.html

def check():
    data = urlopen('url').read();
    return str(data);

doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='test']")
print(el)

It simply prints an empty list.

Edit: How odd. I used google as a test page and it works fine there, but it doesn't work on the page I was using (youtube)

Here's the exact code I'm using.

import lxml.html
from urllib.request import urlopen
import sys

def check():
    data = urlopen('http://www.youtube.com/user/TopGear').read(); #TopGear as a test
    return data.decode('utf-8', 'ignore');


doc = lxml.html.document_fromstring(check())
el = doc.xpath("//div[@class='channel']")
print(el)
3
  • 'url' is a 3-character string. It is not a HTML file. Commented Nov 22, 2011 at 22:43
  • Obviously I did that instead of posting the real url. Commented Nov 23, 2011 at 16:46
  • 1
    Please provide a SSCCE. Commented Nov 23, 2011 at 18:46

3 Answers 3

36

The TopGear page that you use for testing doesn't have any <div class="channel"> elements. But this works (for example):

el = doc.xpath("//div[@class='channel-title-container']")

Or this:

el = doc.xpath("//div[@class='a yb xr']")

To find <div> elements with a class attribute that contains the string channel, you could use

el = doc.xpath("//div[contains(@class, 'channel')]") 
Sign up to request clarification or add additional context in comments.

5 Comments

branded-page channel is not the same as channel.
But, according to css, that element has two classes, branded-page and channel. So why wouldn't it?
Yes, according to CSS there are two classes. But XPath does not know about the rules of CSS. To XPath, branded-page channel is just a string with no special meaning.
That's actually helpful, thanks. Just as a test, I tried to get an element on this page, and it's not working either. This is really starting to piss me off. el = doc.xpath('//a[@class="vote-accepted-off"]') It appears that it doesn't like to find elements that don't have child elements.
Just to complete your answer, we can aslo use not() for negation.Example: el = doc.xpath("//div[contains(@class, 'channel') and not(contains(@class, 'disabled'))]")
3

You can use lxml.cssselect to simplify class and id request: http://lxml.de/dev/cssselect.html

Comments

2

HTML uses classes (a lot), which makes them convenient to hook XPath queries. However XPath has no knowledge/support of CSS classes (or even space-separated lists) which makes classes a pain in the ass to check: the canonically correct way to look for elements having a specific class is:

//*[contains(concat(' ', normalize-space(@class), ' '), '$className')]

In your case this is

el = doc.xpath(
    "//div[contains(concat(' ', normalize-space(@class), ' '), 'channel')]"
)
# print(el)
# [<Element div at 0x7fa44e31ccc8>, <Element div at 0x7fa44e31c278>, <Element div at 0x7fa44e31cdb8>]

or use own XPath function hasclass(*classes)

def _hasaclass(context, *cls):
    return "your implementation ..." 

xpath_utils = etree.FunctionNamespace(None)
xpath_utils['hasaclass'] = _hasaclass

el = doc.xpath("//div[hasaclass('channel')]")

1 Comment

The second arg in contains() should also have spaces added (like ' $className ' and ' channel '). Otherwise you'll still match classes like somechannel.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.