8

I can run crawl in a python script with the following recipe from wiki :

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

As you can see i can just pass the domain to FollowAllSpider but my question is that how can i pass the start_urls (actually a date that will been added to a Fixed url)to my spider class using above code?

this is my spider class:

class MySpider(CrawlSpider):
    name = 'tw'
    def __init__(self,date):
        y,m,d=date.split('-') #this is a test , it could split with regex! 
        try:
            y,m,d=int(y),int(m),int(d)

        except ValueError:
            raise 'Enter a valid date'

        self.allowed_domains = ['mydomin.com']
        self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href') 
        for question in questions:
            item = PoptopItem()
            item['url'] = question.extract()
            yield item['url']

and this is my script :

from pdfcreator import convertor
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
#from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
from poptop.spiders.stackoverflow_spider import MySpider
from poptop.items import PoptopItem

settings = get_project_settings()
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()

date=raw_input('Enter the date with this format (d-m-Y) : ')
print date
spider=MySpider(date=date)
crawler.crawl(spider)
crawler.start()
log.start()
item=PoptopItem()

for url in item['url']:
    convertor(url)

reactor.run() # the script will block here until the spider_closed signal was sent

If i just print the item i'll get the following error :

2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17>

items:

import scrapy


class PoptopItem(scrapy.Item):
    titles= scrapy.Field()
    content= scrapy.Field()
    url=scrapy.Field()

1 Answer 1

9

You need to modify your __init__() constructor to accept the date argument. Also, I would use datetime.strptime() to parse the date string:

from datetime import datetime

class MySpider(CrawlSpider):
    name = 'tw'
    allowed_domains = ['test.com']

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs) 

        date = kwargs.get('date')
        if not date:
            raise ValueError('No date given')

        dt = datetime.strptime(date, "%m-%d-%Y")
        self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

Then, you would instantiate the spider this way:

spider = MySpider(date='01-01-2015')

Or, you can even avoid parsing the date at all, passing a datetime instance in the first place:

class MySpider(CrawlSpider):
    name = 'tw'
    allowed_domains = ['test.com']

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs) 

        dt = kwargs.get('dt')
        if not dt:
            raise ValueError('No date given')

        self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

spider = MySpider(dt=datetime(year=2014, month=01, day=01))

And, just FYI, see this answer as a detailed example about how to run Scrapy from script.

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks a lot for explanation! as i said the time parser is a test! and also thanks for link suggestion, now as you can see my parse function yield the url how can i get that? (after running crawl)
I used items but it raised KeyError seems that it doesn't run the crawl !! for url in item['url']:
@KasraAD I think you just need to yield item instead of yield item['url']. Let me know if it helped or not.
In my spider i just yield item again that error! i will edit the question! i add my script! hope that is could help!
@KasraAD two things: 1. why are you instantiating an item inside the script where you run the crawling (I think you don't need it here) If you want to post-process an item returned from the spider - do it in the pipeline. 2. can you also show the PoptopItem class definition. Thanks.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.