4

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.

<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>

What I really want is the deeply nested table, because it has the header text "Header1". I am trying like so:

from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')

but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex. Any thoughts?

4 Answers 4

3

Use:

//td[text() = 'Header1']/ancestor::table[1]
Sign up to request clarification or add additional context in comments.

2 Comments

@DerrickPetzold: For good XPath/XSLT resources see this answer: stackoverflow.com/questions/339930/…
@DimitreNovatchev The answer you referring to has been deleted: internet archive snapshot
2

Find the header you are interested in and then pull out its table.

//u[b = 'Header1']/ancestor::table[1]

or

//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]

Note that // always starts at the document root (!). You can't do:

//table[//*[contains(text(), "Header1")]]

and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:

//table[.//*[contains(text(), "Header1")]]

won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.

Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.

Comments

0

Perhaps this would work for you:

tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")

The not(descendant::table) bit ensures that you're getting the innermost table.

Comments

0
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
  • //*[text()="Header1"] selects an element anywhere in a document with text Header1.
  • ancestor::table[1] selects the first ancestor of the element that is table.

Complete example

#!/usr/bin/env python
from lxml import html

page = """
<table>
    <tr>
    <td>
        <table>
            <tr><td></td></tr>
            <tr><td>
                <table>
                    <tr><td><u><b>Header1</b></u></td></tr> 
                    <tr><td>Data</td></tr>
                </table>
            </td></tr>
        </table>
     </td></tr>
</table>
"""

tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)

3 Comments

While this is correct for the example, I think it is too generic to use //*[.="Header1"]. There could be a see <i>Header1</i> somewhere in the input and your expression would match the <i>.
@Tomalak: It always matches <table> element. It doesn't matter what element contains "Header1" as long as it is somewhere inside the <table> element.
Right, no argument there. Still, my point is that you might not be matching the table header as such, but anything generic that by chance contains the text 'Header1'. So chances are you match the wrong table.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.