0

After having created a few different spiders I thought I could scrape practically anything, but I've hit a roadblock.

Given the following code snippet:

<div class="col-md-4">
    <div class="tab-title">Homepage</div>
    <p>
        <a target="_blank" rel="nofollow" 
         href="http://www.bitcoin.org">http://www.bitcoin.org
        </a>
    </p>
</div>

How would you go about selecting the link that is in within <a ... </a> based on the text within the tab-title div?

The reason that I require that condition is because there are several other links that fit this condition:

response.css('div.col-md-4 a::attr(href)').extract()

My best guess is the following:

response.css('div.col-md-4 div.tab-title:contains("Homepage") a::attr(href)').extract()

Any insights are appreciated! Thank you in advance.

Note: I am using Scrapy.

1 Answer 1

2

How about this using XPath:

response.xpath('//div[@class="tab-title" and contains(., "Homepage")]/..//a/@href')

Find a div with class tab-title which contains Homepage inside, then step up to the parent and look for a child on any level.

EDIT: Using CSS, you should be able to do it like this:

response.css('div.tab-title:contains("Homepage") ~ * a::attr(href)')
Sign up to request clarification or add additional context in comments.

4 Comments

Thank you. Can you "step up" using css selectors or is that functionality limited to XPath selectors?
Very nice, I was unaware of the ~ and * symbols for css selectors. If you don't mind would you briefly explain what they do with regards to this example?
Basically it says to select all elements preceded by div with class tab-title and containing Homepage string. Check out the reference at W3Schools.
I was referring to the symbols, but thank you regardless.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.