0

I am trying to parse several items from a blog but I am unable to to reach the last two items I need.

The html is:

        <div class="post">
            <div class="postHeader">
                <h2 class="postTitle"><span></span><a href="http://website.com" title="cuba and the cameraman">cuba and the cameraman</a></h2>
                <span class="postMonth" title="2017">Nov</span>
                <span class="postDay" title="2017">24</span>
                <div class="postSubTitle"><span class="postCategories"><a href="http://website.com" rel="category tag">TV Shows</a></span></div>
            </div>
            <div class="postContent"><p><a target="_blank" href="https://image.com/test.jpg"><img class="aligncenter" src="https://image.com/test.jpg"/></a>&nbsp;<br />
n/A<br />
&nbsp;<br />
<strong>Links:</strong> <a target='_blank' href='http://www.imdb.com/title/tt7320560/'>IMDB</a><br />
    &nbsp;</p>

The data I need is the "cuba and the cameraman" (code below), the "https://image.com/test.jpg" url and the "http://www.imdb.com/title/tt7320560/" IMDB link.

I managed to parse correctly only all the postTile for the website:

    all_titles = []
    url = 'http://test.com'
    browser.get(url)
    titles = browser.find_elements_by_class_name('postHeader')
    for title in titles:
        link = title.find_element_by_tag_name('a')
        all_titles.append(link.text)

But I can't get the the image and imdb links using the same method as above , class name. COuld you support me on this? Thanks.

1 Answer 1

1

You need a more accurate search, there is a family of find_element_by_XX functions built in, try xpath:

for post in driver.find_elements_by_xpath('//div[@class="post"]'):
    title = post.find_element_by_xpath('.//h2[@class="postTitle"]//a').text
    img_src = post.find_element_by_xpath('.//div[@class="postContent"]//img').get_attribute('src')
    link = post.find_element_by_xpath('.//div[@class="postContent"]//a[last()]').get_attribute('href')

Remeber you can always get the html source by driver.page_source and parse it using whatever tool you like.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. 2questions; you mean attribute instead of property right? other question, instead of a[last()] for the last link of href, what If I wanted the second one? Just noticed that the <strong>Links:</strong> has 3 links, and I need only the second. Thanks again.
yes, get_attribute is more accurate and for selecting second link, you can use a number instead of last() like .//div[@class="postContent"]//a[2]

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.