2

I don't want to hard coded the xpath and the css selectors of my items in the spider.

Instead, I want to save them in a place and extract them dynamically when the spider runs.

Is there any official support for this feature pelase?

what i have tried

I made a dictionary, the key is the item name and the value is a list contains two values, the first value is for xpath and the second one is for css.

4
  • "key is the item name": you mean item field name? Why 2 values, XPath and CSS? I'm not sure what you want to achieve. Tell us more. Commented Jan 31, 2014 at 9:02
  • @pault. i am trying to save the css and xpath values in a class and call them later with sel like this sel.xpath(valueofxpath).css(valueofcss) Commented Jan 31, 2014 at 13:53
  • I guess you could pass a JSON string as parameter to your spider. Commented Jan 31, 2014 at 14:10
  • @pault. kindly give me an example in an answer Commented Jan 31, 2014 at 15:15

1 Answer 1

1

You can define a config file similar to this (borrowing from the Scrapy tutorial):

{
    "region": [["xpath", "//ul/li"]],
    "fields": {
        "title": [["xpath", "a/text()"]],
        "link": [["xpath", "./a"], ["css", "::attr(href)"]],
        "desc": [["xpath", "text()"]]
    }
}

Where lists represent the consecutive selector expressions to apply, either a CSS selector or an XPath expression. For example,

  • [["xpath", "//ul/li"]] means "apply the XPath expression //ul/li, i.e. sel.xpath('//ul/li')
  • [["xpath", "./a"], ["css", "::attr(href)"]] means "apply the XPath expression ./a, the the CSS selector ::attr(href) (note: this is non-standard, it's a Scrapy extension), equivalent to sel.xpath('./a').css('::attr(href)')

I added a config "region" to apply the selectors on a specific region

You can pass a JSON string as an argument to your spider (-a argname=argvalue), and your argument is available as an attribute of your spider -- self.selconfig in my case.

Spider code:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
import json
import pprint

def apply_exp_list(selector, expression_list):
    out = selector
    for expr_type, expr_val in expression_list:
        if expr_type == "xpath":
            out = out.xpath(expr_val)
        elif expr_type == "css":
            out = out.css(expr_val)
    return out

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        sel = Selector(response)

        config = json.loads(self.selconfig)
        self.log("selector configuration: \n%s" % pprint.pformat(config))

        regions = apply_exp_list(sel, config["region"])
        items = []
        for region in regions:
            item = DmozItem()
            for field_name, exp_list in config["fields"].items():
                item[field_name] = apply_exp_list(region, exp_list).extract()
            items.append(item)
        return items

And on the command line, for example:

paul@wheezy:~/tmp/stackoverflow$ scrapy runspider 21474657.py \
-a selconfig='{"fields": {"desc": [["xpath", "text()"]], "link": [["xpath", "./a"], ["css", "::attr(href)"]], "title": [["xpath", "a/text()"]]}, "region": [["xpath", "//ul/li"]]}'

Note: I played around a bit, it's maybe a bit more complex than expected, it reminded me of (disclaimer: my own) parslepy project.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.