Selecting siblings from html with Python lxml(html) library

Question

I process HTML in Python with the help of the lxml library. I am trying to parse this website, my objective is to parse out all the games that happened in the regular season(not the ones in play-off or pres-eason). The problem that I have encountered:

I select all elements that have the class nob-border, which I can do.

subpage.cssselect(".nob-border")

The library lxml has this function cssselect which allows to select HTML elements with CSS selectors. What I would like to do next, is select every element until the next tr element that has the class nob-border. The HTML looks like this:

<tr class="center nob-border">
<tr class="table-dummyrow">
<tr class="odd deactivate" xeid="IqLK6ZNT">
<tr class=" deactivate" xeid="l0Xo8yvB">
<tr class="odd deactivate" xeid="QLnrBc9b">
<tr class=" deactivate" xeid="8pxmAHO4">
<tr class="odd deactivate" xeid="nVmvCwfh">
<tr class=" deactivate" xeid="v1lEBJvn">
<tr class="center nob-border">

There are rows with the class nob-border and a bunch of rows between those rows. I need to select those in between. More than that I don't want to just select all the rows in between, I want to select for every row with the nob-border class the ones that are below that row and above the next one with the class nob-border. I hope was I clear enough, if not do not hesitate on asking questions.

paul trmbrth · Accepted Answer · 2014-02-21 14:03:45Z

It's not that elegant but I can propose this:

for tr in subpage.cssselect('tr.nob-border'):
    previous = tr.xpath('count(./preceding-sibling::tr)+1')
    next = tr.xpath('count(./following-sibling::tr[contains(@class, "nob-border")][1]/preceding-sibling::tr)+1')
    tr_in_between = tr.xpath('./following-sibling::tr[position() < $next]', next=next-previous)

For each table row tr with "nob-border" class,

determine the current row position in the tr siblings sequence
determine the position of the next tr row with "nob-border" class
select all tr siblings with a position in between those 2 positions

Here's an alternative solution using the "sets" EXSLT extensions:

for tr in subpage.cssselect('tr.nob-border'):
    tr.xpath(""" set:difference(following-sibling::tr[not(contains(@class, "nob-border"))],
                                following-sibling::tr[contains(@class, "nob-border")]
                                                   /following-sibling::tr)""",
             namespaces={"set": "http://exslt.org/sets"})

Guy Gavriely · Accepted Answer · 2014-02-22 02:06:53Z

0

this lean more on python, leaving cssselect earlier:

>>> trs = subpage.cssselect('tr')
>>> for prev, curr, next in zip(trs, trs[1:], trs[2:]):
...     if curr.cssselect('.nob-border'):
...         print prev,curr,next

answered Feb 22, 2014 at 2:06

Guy Gavriely

11.4k6 gold badges30 silver badges43 bronze badges

Collectives™ on Stack Overflow

Selecting siblings from html with Python lxml(html) library

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related