2

I am trying to get sub element with lxml.html, the code is as below.

import lxml.html as LH

html = """
<ul class="news-list2">
            <li>
            <div class="txt-box">
            <p class="info">Number:<label>cewoilgas</label></p>
            </div>
            </li>

            <li>
            <div class="txt-box">
            <p class="info">Number:<label>NHYQZX</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>energyinfo</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>calgary_information</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>oilgas_pro</label>
            </p>
            </div>
            </li>

</ul>
"""

To get the sub element in li:

htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
    print li.xpath("//p/label/text()")

Curious why the outcome is

['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']

And I also found the solution is:

htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
    print li.xpath(".//p/label/text()")

the result is:

['cewoilgas']
['NHYQZX']
['energyinfo']
['calgary_information']
['oilgas_pro']

Should this be regarded as a bug for lxml? why xpath still match through the whole root element (ul) while it is under the sub-element (li)?

2 Answers 2

3

No, this is not a bug, but is an intended behavior. If you start your expression with //, it does not matter if you call it on the root of the tree or on any element of the tree - it is going to be absolute and it is going to be applied from the root.

Just remember, if calling xpath() on an element and you want it to work relative from this element, always start your expressions with a dot which would refer to a current node.

By the way, absolutely (pun intended) the same happens in selenium and it's find_element(s)_by_xpath().

Sign up to request clarification or add additional context in comments.

2 Comments

ok, if it is intended. Just accept this as it is... however, this is really confusing for newbies...
Actually I think the OP is right. This behaviour doesn't make any sense. If you call etree.tostring(li), you will get an xml fragment as string, while using li in a new xpath expression works on the original whole tree. This is very contraintuitive.
0

//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node

//olist/item selects all the item elements in the same document as the context node that have an olist parent

. selects the context node

.//para selects the para element descendants of the context node

you can find more example in XML Path Language (XPath)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.