2

Is it possible with Scrapy to combine Xpath and CSS selectors in an Item Loader?

I admit, until know, I've avoided Item Loaders for simplicity, but I'm at the point now where I feel I need them for maintainability.

To date, I've been chaining Xpath and CSS selectors together for some of my selectors, such as sel.xpath('.//td[@class="desc"]').css('.title'). I do this as there are a mixture of additional classes mixed in with title, or title doesn't have even spacing (also, it's the recommended way in the documentation).

With a loader, I'm only seeing a .add_xpath() method and a separate .add_css() method. Is there a "proper" way to do this?

1 Answer 1

2

In general, we try to avoid mixing XPath expressions with CSS selectors and this is usually quite easy to achieve. But, if you want to use Item Loaders and, at the same time, mix XPath and CSS, you would need to use what ItemLoader uses internally.

Something along these lines:

from scrapy.loader import ItemLoader
from scrapy.utils.python import flatten

class MyItemLoader(ItemLoader):
    def add_xpath_and_css(self, field_name, xpaths, csss, *processors, **kw):
        # get the xpath results first
        xpath_results = flatten([self.selector.xpath(xpath) for xpath in xpaths])

        # for every xpath result apply a css selector
        values = flatten([xpath_result.css(css).extract() for xpath_result in xpath_results for css in csss])

        self.add_value(field_name, values, *processors, **kw)
Sign up to request clarification or add additional context in comments.

5 Comments

I don't think this would have the same effect as chaining together selectors. For instance, sel.xpath('somexpath').css('somecss').xpath('morexpath'). With this differing behaviour, I think it would be counterproductive to my objective of increasing maintainability. Looks like I'll have to just avoid it all together.
@Rejected yeah, it only covers xpath->css case (not tested though), but you can think about improving it. Note that you can always use the selector directly loader.selector.xpath('somexpath').css('somecss')... and then use add_value() to add the extracted values to the item loader instance..
Would there be any immediate downside using the .add_value('field', loader.selector....) method over using the .add_xpath or .add_css methods? If not, it looks like that's a more versatile option over both.
@Rejected not that I can think of from the top of my head. If you would use add_value() you'll still have the processors applied. add_css() and add_xpath() basically call self.selector.css and self.selector.xpath respectively and use add_value() after..hope that helps.
It's a huge help. I'll test it out and see how it works. If I don't run into any issues, then you've answered exactly what I was asking for.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.