get sub elements with xpath of lxml.html (Python)

Question

I am trying to get sub element with lxml.html, the code is as below.

import lxml.html as LH

html = """
<ul class="news-list2">
            <li>
            <div class="txt-box">
            <p class="info">Number:<label>cewoilgas</label></p>
            </div>
            </li>

            <li>
            <div class="txt-box">
            <p class="info">Number:<label>NHYQZX</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>energyinfo</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>calgary_information</label>
            </p>
            </div>
            </li>

        <li>
            <div class="txt-box">
            <p class="info">Number:<label>oilgas_pro</label>
            </p>
            </div>
            </li>

</ul>
"""

To get the sub element in li:

htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
    print li.xpath("//p/label/text()")

Curious why the outcome is

['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']

And I also found the solution is:

htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
    print li.xpath(".//p/label/text()")

the result is:

['cewoilgas']
['NHYQZX']
['energyinfo']
['calgary_information']
['oilgas_pro']

Should this be regarded as a bug for lxml? why xpath still match through the whole root element (ul) while it is under the sub-element (li)?

alecxe · Accepted Answer · 2016-12-22 07:43:09Z

3

No, this is not a bug, but is an intended behavior. If you start your expression with //, it does not matter if you call it on the root of the tree or on any element of the tree - it is going to be absolute and it is going to be applied from the root.

Just remember, if calling xpath() on an element and you want it to work relative from this element, always start your expressions with a dot which would refer to a current node.

By the way, absolutely (pun intended) the same happens in selenium and it's find_element(s)_by_xpath().

answered Dec 22, 2016 at 7:43

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ZY Huang Over a year ago

ok, if it is intended. Just accept this as it is... however, this is really confusing for newbies...

fotis j Over a year ago

Actually I think the OP is right. This behaviour doesn't make any sense. If you call etree.tostring(li), you will get an xml fragment as string, while using li in a new xpath expression works on the original whole tree. This is very contraintuitive.

宏杰李 · Accepted Answer · 2016-12-22 08:03:53Z

0

//para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node

//olist/item selects all the item elements in the same document as the context node that have an olist parent

. selects the context node

.//para selects the para element descendants of the context node

you can find more example in XML Path Language (XPath)

answered Dec 22, 2016 at 8:03

宏杰李

12.2k2 gold badges32 silver badges37 bronze badges

Collectives™ on Stack Overflow

get sub elements with xpath of lxml.html (Python)

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related