web parsing using selenium and classes

Question

I am trying to parse several items from a blog but I am unable to to reach the last two items I need.

The html is:

        <div class="post">
            <div class="postHeader">
                <h2 class="postTitle"><span></span><a href="http://website.com" title="cuba and the cameraman">cuba and the cameraman</a></h2>
                <span class="postMonth" title="2017">Nov</span>
                <span class="postDay" title="2017">24</span>
                <div class="postSubTitle"><span class="postCategories"><a href="http://website.com" rel="category tag">TV Shows</a></span></div>
            </div>
            <div class="postContent"><p><a target="_blank" href="https://image.com/test.jpg"><img class="aligncenter" src="https://image.com/test.jpg"/></a>&nbsp;<br />
n/A<br />
&nbsp;<br />
<strong>Links:</strong> <a target='_blank' href='http://www.imdb.com/title/tt7320560/'>IMDB</a><br />
    &nbsp;</p>

The data I need is the "cuba and the cameraman" (code below), the "https://image.com/test.jpg" url and the "http://www.imdb.com/title/tt7320560/" IMDB link.

I managed to parse correctly only all the postTile for the website:

    all_titles = []
    url = 'http://test.com'
    browser.get(url)
    titles = browser.find_elements_by_class_name('postHeader')
    for title in titles:
        link = title.find_element_by_tag_name('a')
        all_titles.append(link.text)

But I can't get the the image and imdb links using the same method as above , class name. COuld you support me on this? Thanks.

CtheSky · Accepted Answer · 2017-11-24 14:32:05Z

1

You need a more accurate search, there is a family of find_element_by_XX functions built in, try xpath:

for post in driver.find_elements_by_xpath('//div[@class="post"]'):
    title = post.find_element_by_xpath('.//h2[@class="postTitle"]//a').text
    img_src = post.find_element_by_xpath('.//div[@class="postContent"]//img').get_attribute('src')
    link = post.find_element_by_xpath('.//div[@class="postContent"]//a[last()]').get_attribute('href')

Remeber you can always get the html source by driver.page_source and parse it using whatever tool you like.

edited Nov 24, 2017 at 14:32

answered Nov 24, 2017 at 12:24

CtheSky

2,64416 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gonzalo Over a year ago

Thanks. 2questions; you mean attribute instead of property right? other question, instead of a[last()] for the last link of href, what If I wanted the second one? Just noticed that the <strong>Links:</strong> has 3 links, and I need only the second. Thanks again.

CtheSky Over a year ago

yes, get_attribute is more accurate and for selecting second link, you can use a number instead of last() like .//div[@class="postContent"]//a[2]

Collectives™ on Stack Overflow

web parsing using selenium and classes

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related