How to use the css selector to extract urls with python's scrapy?

Question

In order to learn scrapy I am crawling all the elements of this website:http://quotes.toscrape.com/random

However, I do not understand how to crawl the author url bio. I tried to use the css selector:

>>> response.css('a::attr(href)').extract()
['/', '/login', '/author/Ralph-Waldo-Emerson', '/tag/life/page/1/', '/tag/regrets/page/1/', 'https://www.goodreads.com/quotes', 'https://scrapinghub.com']

Then:

>>> response.css('small.quote>span>a::attr(href)').extract()

Nevertheless, I am not getting the author's bio url. Thus, how can I get the aforementioned url with the css selector?.

UPDATE

I already know that I can do:

response.css('a::attr(href)').extract()[2]

However, I guess this is not robust. Any idea of how to get the bio link?.

JkShaw · Accepted Answer · 2017-04-24 18:30:56Z

2

This might work:

>>> os.path.dirname(response.url)
'http://quotes.toscrape.com'

>> response.css('a::attr(href)').extract()[2]
u'/author/Bob-Marley'

>>> os.path.dirname(response.url) + response.css('a::attr(href)').extract()[2]
u'http://quotes.toscrape.com/author/Bob-Marley'

answered Apr 24, 2017 at 18:30

JkShaw

1,9372 gold badges13 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to use the css selector to extract urls with python's scrapy?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related