1

I process HTML in Python with the help of the lxml library. I am trying to parse this website, my objective is to parse out all the games that happened in the regular season(not the ones in play-off or pres-eason). The problem that I have encountered:

I select all elements that have the class nob-border, which I can do.

subpage.cssselect(".nob-border")

The library lxml has this function cssselect which allows to select HTML elements with CSS selectors. What I would like to do next, is select every element until the next tr element that has the class nob-border. The HTML looks like this:

<tr class="center nob-border">
<tr class="table-dummyrow">
<tr class="odd deactivate" xeid="IqLK6ZNT">
<tr class=" deactivate" xeid="l0Xo8yvB">
<tr class="odd deactivate" xeid="QLnrBc9b">
<tr class=" deactivate" xeid="8pxmAHO4">
<tr class="odd deactivate" xeid="nVmvCwfh">
<tr class=" deactivate" xeid="v1lEBJvn">
<tr class="center nob-border"> 

There are rows with the class nob-border and a bunch of rows between those rows. I need to select those in between. More than that I don't want to just select all the rows in between, I want to select for every row with the nob-border class the ones that are below that row and above the next one with the class nob-border. I hope was I clear enough, if not do not hesitate on asking questions.

2 Answers 2

1

It's not that elegant but I can propose this:

for tr in subpage.cssselect('tr.nob-border'):
    previous = tr.xpath('count(./preceding-sibling::tr)+1')
    next = tr.xpath('count(./following-sibling::tr[contains(@class, "nob-border")][1]/preceding-sibling::tr)+1')
    tr_in_between = tr.xpath('./following-sibling::tr[position() < $next]', next=next-previous)

For each table row tr with "nob-border" class,

  • determine the current row position in the tr siblings sequence
  • determine the position of the next tr row with "nob-border" class
  • select all tr siblings with a position in between those 2 positions

Here's an alternative solution using the "sets" EXSLT extensions:

for tr in subpage.cssselect('tr.nob-border'):
    tr.xpath(""" set:difference(following-sibling::tr[not(contains(@class, "nob-border"))],
                                following-sibling::tr[contains(@class, "nob-border")]
                                                   /following-sibling::tr)""",
             namespaces={"set": "http://exslt.org/sets"})
Sign up to request clarification or add additional context in comments.

Comments

0

this lean more on python, leaving cssselect earlier:

>>> trs = subpage.cssselect('tr')
>>> for prev, curr, next in zip(trs, trs[1:], trs[2:]):
...     if curr.cssselect('.nob-border'):
...         print prev,curr,next

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.