I am trying to get sub element with lxml.html, the code is as below.
import lxml.html as LH
html = """
<ul class="news-list2">
<li>
<div class="txt-box">
<p class="info">Number:<label>cewoilgas</label></p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>NHYQZX</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>energyinfo</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>calgary_information</label>
</p>
</div>
</li>
<li>
<div class="txt-box">
<p class="info">Number:<label>oilgas_pro</label>
</p>
</div>
</li>
</ul>
"""
To get the sub element in li:
htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
print li.xpath("//p/label/text()")
Curious why the outcome is
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
['cewoilgas', 'NHYQZX', 'energyinfo', 'calgary_information', 'oilgas_pro']
And I also found the solution is:
htm = LH.fromstring(html)
for li in htm.xpath("//ul/li"):
print li.xpath(".//p/label/text()")
the result is:
['cewoilgas']
['NHYQZX']
['energyinfo']
['calgary_information']
['oilgas_pro']
Should this be regarded as a bug for lxml? why xpath still match through the whole root element (ul) while it is under the sub-element (li)?