python scrapy dynamic select xpaths and css

Question

I don't want to hard coded the xpath and the css selectors of my items in the spider.

Instead, I want to save them in a place and extract them dynamically when the spider runs.

Is there any official support for this feature pelase?

what i have tried

I made a dictionary, the key is the item name and the value is a list contains two values, the first value is for xpath and the second one is for css.

"key is the item name": you mean item field name? Why 2 values, XPath and CSS? I'm not sure what you want to achieve. Tell us more. — paul trmbrth
– paul trmbrth, Commented Jan 31, 2014 at 9:02
@pault. i am trying to save the css and xpath values in a class and call them later with sel like this sel.xpath(valueofxpath).css(valueofcss) — Marco Dinatsoli
– Marco Dinatsoli, Commented Jan 31, 2014 at 13:53
I guess you could pass a JSON string as parameter to your spider. — paul trmbrth
– paul trmbrth, Commented Jan 31, 2014 at 14:10

paul trmbrth · Accepted Answer · 2014-01-31 18:30:51Z

You can define a config file similar to this (borrowing from the Scrapy tutorial):

{
    "region": [["xpath", "//ul/li"]],
    "fields": {
        "title": [["xpath", "a/text()"]],
        "link": [["xpath", "./a"], ["css", "::attr(href)"]],
        "desc": [["xpath", "text()"]]
    }
}

Where lists represent the consecutive selector expressions to apply, either a CSS selector or an XPath expression. For example,

[["xpath", "//ul/li"]] means "apply the XPath expression //ul/li, i.e. sel.xpath('//ul/li')
[["xpath", "./a"], ["css", "::attr(href)"]] means "apply the XPath expression ./a, the the CSS selector ::attr(href) (note: this is non-standard, it's a Scrapy extension), equivalent to sel.xpath('./a').css('::attr(href)')

I added a config "region" to apply the selectors on a specific region

You can pass a JSON string as an argument to your spider (-a argname=argvalue), and your argument is available as an attribute of your spider -- self.selconfig in my case.

Spider code:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.item import Item, Field
import json
import pprint

def apply_exp_list(selector, expression_list):
    out = selector
    for expr_type, expr_val in expression_list:
        if expr_type == "xpath":
            out = out.xpath(expr_val)
        elif expr_type == "css":
            out = out.css(expr_val)
    return out

class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
       "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        sel = Selector(response)

        config = json.loads(self.selconfig)
        self.log("selector configuration: \n%s" % pprint.pformat(config))

        regions = apply_exp_list(sel, config["region"])
        items = []
        for region in regions:
            item = DmozItem()
            for field_name, exp_list in config["fields"].items():
                item[field_name] = apply_exp_list(region, exp_list).extract()
            items.append(item)
        return items

And on the command line, for example:

paul@wheezy:~/tmp/stackoverflow$ scrapy runspider 21474657.py \
-a selconfig='{"fields": {"desc": [["xpath", "text()"]], "link": [["xpath", "./a"], ["css", "::attr(href)"]], "title": [["xpath", "a/text()"]]}, "region": [["xpath", "//ul/li"]]}'

Note: I played around a bit, it's maybe a bit more complex than expected, it reminded me of (disclaimer: my own) parslepy project.

Collectives™ on Stack Overflow

python scrapy dynamic select xpaths and css

what i have tried

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

what i have tried

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related