How can I collect this data from a div using Selenium and Python

Question

I have been using Selenium and Python to scrape a webpage and I am having difficulty collecting data that I want out of a div that has the following structure:

<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Heading1</span>
    <span class="MainGridcolumn2">Text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Another heading</span>
    <span class="MainGridcolumn2">More text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Next heading</span>
    <span class="MainGridcolumn2">Even more text</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text</span>
  </div>
</div>

The div has a number of rows, each with 2 columns containing the data/text inside of span tags. There are no CSS ids.

I'm only interested in collecting the text contained within the 'MainGridcolumn2' span classes.

I've tried the below to navigate to the first heading, with the intention of then trying to use 'following_sibling' to move down to the next span tag containing the text, but I can't even get this to work as it isn't returning any text when I try to print it to the console:

driver.find_element_by_xpath("//span['@class=MainGridcolumn1'][contains(text(), 'Heading1')]").text

and

driver.find_element_by_xpath("//span[contains(text(), 'Heading1')]").text

Hi, the text in MainGridcolumn1 are headings that never change, but the values in MainGridcolumn2 are always different, and the order of the data can change on different pages. So I was trying to navigate to the heading spans then move to the following span to collect the value. — Matt
– Matt, Commented Jul 3, 2016 at 21:33
Just curious, why are you using selenium to scrape a web page? Selenium is designed for testing. Its XPath implementation also isn't (or didn't use to be) very robust -- it would come up against limitations if you tried to push it too far. Would you consider an alternative, like BeautifulSoup? — LarsH
– LarsH, Commented Jul 4, 2016 at 2:53

Padraic Cunningham · Accepted Answer · 2016-07-03 22:05:47Z

1

One way would be to get the the enclosing div i.e the grandparent and pull the spans from that:

h = """<div class="col span_6" style="margin-left: 12px;width: 47% !important;">
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Heading1</span>
    <span class="MainGridcolumn2">Text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Another heading</span>
    <span class="MainGridcolumn2">More text that I want</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Next heading</span>
    <span class="MainGridcolumn2">Even more text</span>
  </div>
  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text</span>
  </div>
</div>

  <div class="MainGridRow">
    <span class="MainGridcolumn1">Yet another heading</span>
    <span class="MainGridcolumn2">Piece of text I don't want</span>
  </div>"""

from lxml import html

xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/../..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text()"))

Which would give you:

['Text that I want', 'More text that I want', 'Even more text', 'Piece of text']

You could also just select the parent and get the parents siblings

from lxml import html

xm = html.fromstring(h)
div = xm.xpath("//span[@class='MainGridcolumn1'][contains(text(), 'Heading1')]/..")[0]
print(div.xpath(".//span[@class='MainGridcolumn2']/text() | .//following-sibling::div/span[@class='MainGridcolumn2']/text()"))

edited Jul 3, 2016 at 22:05

answered Jul 3, 2016 at 21:50

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Matt Over a year ago

Thanks for this. I originally passed the entire page source to "h" but that seemed to be too big and python threw an error, so I just needed to navigate down to this particular div and then use the rest of your code.

Matt Over a year ago

Off the top of my head I can't remember, but I think it was something about a string being too long?

Padraic Cunningham Over a year ago

Interesting, I could see if you mistakenly used parse how that would happen but I have not seen an error like that using fromstring bar you rn out of memory

Matt Over a year ago

Ahh yes you are probably right, I think I might have done that!

Collectives™ on Stack Overflow

How can I collect this data from a div using Selenium and Python

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related