2

I have been using Selenium and Python to scrape a webpage and I am having difficulty collecting data that I want out of a div that has the following structure:

<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Heading1</span>
    <span class="MainGridcolumn2">Text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Another heading</span>
    <span class="MainGridcolumn2">More text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Next heading</span>
    <span class="MainGridcolumn2">Even more text</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text</span>
  </div>
</div>

The div has a number of rows, each with 2 columns containing the data/text inside of span tags. There are no CSS ids.

I'm only interested in collecting the text contained within the 'MainGridcolumn2' span classes.

I've tried the below to navigate to the first heading, with the intention of then trying to use 'following_sibling' to move down to the next span tag containing the text, but I can't even get this to work as it isn't returning any text when I try to print it to the console:

driver.find_element_by_xpath("//span['@class=MainGridcolumn1'][contains(text(), 'Heading1')]").text

and

driver.find_element_by_xpath("//span[contains(text(), 'Heading1')]").text
7
  • Hi, the text in MainGridcolumn1 are headings that never change, but the values in MainGridcolumn2 are always different, and the order of the data can change on different pages. So I was trying to navigate to the heading spans then move to the following span to collect the value. Commented Jul 3, 2016 at 21:33
  • Are there other "MainGridRow" divs? Commented Jul 3, 2016 at 21:35
  • Yes - about 20 in total Commented Jul 3, 2016 at 21:35
  • What about div class="col span_6" ? Commented Jul 3, 2016 at 21:46
  • Just curious, why are you using selenium to scrape a web page? Selenium is designed for testing. Its XPath implementation also isn't (or didn't use to be) very robust -- it would come up against limitations if you tried to push it too far. Would you consider an alternative, like BeautifulSoup? Commented Jul 4, 2016 at 2:53

1 Answer 1

1

One way would be to get the the enclosing div i.e the grandparent and pull the spans from that:

h = """<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Heading1</span>
    <span class="MainGridcolumn2">Text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Another heading</span>
    <span class="MainGridcolumn2">More text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Next heading</span>
    <span class="MainGridcolumn2">Even more text</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text</span>
  </div>
</div>

  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text I don't want</span>
  </div>"""

from lxml import html

xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/../..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text()"))

Which would give you:

['Text that I want', 'More text that I want', 'Even more text', 'Piece of text']

You could also just select the parent and get the parents siblings

from lxml import html

xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text() | .//following-sibling::div/span[@class='MainGridcolumn2']/text()"))
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for this. I originally passed the entire page source to "h" but that seemed to be too big and python threw an error, so I just needed to navigate down to this particular div and then use the rest of your code.
Off the top of my head I can't remember, but I think it was something about a string being too long?
Interesting, I could see if you mistakenly used parse how that would happen but I have not seen an error like that using fromstring bar you rn out of memory
Ahh yes you are probably right, I think I might have done that!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.