extract specific element from nested elements using lxml html

Question

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.

<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>

What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so:

from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')

but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?

Dimitre Novatchev · Accepted Answer · 2010-04-14 13:04:24Z

3

Use:

//td[text() = 'Header1']/ancestor::table[1]

answered Apr 14, 2010 at 13:04

Dimitre Novatchev

244k27 gold badges307 silver badges438 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Dimitre Novatchev Over a year ago

@DerrickPetzold: For good XPath/XSLT resources see this answer: stackoverflow.com/questions/339930/…

kynan Over a year ago

@DimitreNovatchev The answer you referring to has been deleted: internet archive snapshot

Tomalak · Accepted Answer · 2010-04-14 08:54:11Z

Find the header you are interested in and then pull out its table.

//u[b = 'Header1']/ancestor::table[1]

or

//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]

Note that // always starts at the document root (!). You can't do:

//table[//*[contains(text(), "Header1")]]

and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:

//table[.//*[contains(text(), "Header1")]]

won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.

Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.

Michał Marczyk · Accepted Answer · 2010-04-14 05:48:14Z

0

Perhaps this would work for you:

tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")

The not(descendant::table) bit ensures that you're getting the innermost table.

answered Apr 14, 2010 at 5:48

Michał Marczyk

84.5k13 gold badges203 silver badges212 bronze badges

Comments

jfs · Accepted Answer · 2010-04-14 06:05:19Z

0

table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')

//*[text()="Header1"] selects an element anywhere in a document with text Header1.
ancestor::table[1] selects the first ancestor of the element that is table.

Complete example

#!/usr/bin/env python
from lxml import html

page = """
<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>
"""

tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)

answered Apr 14, 2010 at 6:05

jfs

417k210 gold badges1k silver badges1.7k bronze badges

3 Comments

Tomalak Over a year ago

While this is correct for the example, I think it is too generic to use //*[.="Header1"]. There could be a see <i>Header1</i> somewhere in the input and your expression would match the <i>.

jfs Over a year ago

@Tomalak: It always matches <table> element. It doesn't matter what element contains "Header1" as long as it is somewhere inside the <table> element.

Tomalak Over a year ago

Right, no argument there. Still, my point is that you might not be matching the table header as such, but anything generic that by chance contains the text 'Header1'. So chances are you match the wrong table.

Collectives™ on Stack Overflow

extract specific element from nested elements using lxml html

4 Answers 4

2 Comments

Comments

Comments

Complete example

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

Complete example

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related