You can define a config file similar to this (borrowing from the Scrapy tutorial):
{
"region": [["xpath", "//ul/li"]],
"fields": {
"title": [["xpath", "a/text()"]],
"link": [["xpath", "./a"], ["css", "::attr(href)"]],
"desc": [["xpath", "text()"]]
}
}
Where lists represent the consecutive selector expressions to apply, either a CSS selector or an XPath expression.
For example,
[["xpath", "//ul/li"]] means "apply the XPath expression //ul/li, i.e. sel.xpath('//ul/li')
[["xpath", "./a"], ["css", "::attr(href)"]] means "apply the XPath expression ./a, the the CSS selector ::attr(href) (note: this is non-standard, it's a Scrapy extension), equivalent to sel.xpath('./a').css('::attr(href)')
I added a config "region" to apply the selectors on a specific region
You can pass a JSON string as an argument to your spider (-a argname=argvalue), and your argument is available as an attribute of your spider -- self.selconfig in my case.
Spider code:
from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
import json
import pprint
def apply_exp_list(selector, expression_list):
out = selector
for expr_type, expr_val in expression_list:
if expr_type == "xpath":
out = out.xpath(expr_val)
elif expr_type == "css":
out = out.css(expr_val)
return out
class DmozItem(Item):
title = Field()
link = Field()
desc = Field()
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
sel = Selector(response)
config = json.loads(self.selconfig)
self.log("selector configuration: \n%s" % pprint.pformat(config))
regions = apply_exp_list(sel, config["region"])
items = []
for region in regions:
item = DmozItem()
for field_name, exp_list in config["fields"].items():
item[field_name] = apply_exp_list(region, exp_list).extract()
items.append(item)
return items
And on the command line, for example:
paul@wheezy:~/tmp/stackoverflow$ scrapy runspider 21474657.py \
-a selconfig='{"fields": {"desc": [["xpath", "text()"]], "link": [["xpath", "./a"], ["css", "::attr(href)"]], "title": [["xpath", "a/text()"]]}, "region": [["xpath", "//ul/li"]]}'
Note: I played around a bit, it's maybe a bit more complex than expected, it reminded me of (disclaimer: my own) parslepy project.
sellike thissel.xpath(valueofxpath).css(valueofcss)