2

I'm attempting to run Scrapy within a Python script. Here's the relevant code:

import scrapy
from scrapy.crawler import CrawlerProcess

class PostSpider(scrapy.Spider):
    name = "post crawler"
    allowed_domains = ['test.com']

    def __init__(self, **kwargs):
        super(PostSpider, self).__init__(**kwargs)

        url = kwargs.get('url')
        print(url)
        self.start_urls = ['https://www.test.com/wp-json/test/2.0/posts' + url]

    def parse(self, response):
        post = json.loads(response.body_as_unicode())
        post = post["content"]
        return post

posts = GA.retrieve(TIA.start_date, TIA.end_date, "content type auto")

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
}) 

for post in posts:
    post_url = post[2]
    process.crawl(PostSpider(url=post_url))
    process.start()

I'm attempting to follow the guidelines here and here somewhat. However, I can't get it to work. Here's the error message I received:

Unhandled error in Deferred:
2016-03-25 20:49:43 [twisted] CRITICAL: Unhandled error in Deferred:


Traceback (most recent call last):
  File "text_analysis.py", line 48, in <module>
    process.crawl(PostSpider(url=post_url))
  File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 163, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 167, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "/Users/terence/TIA/lib/python3.5/site-packages/twisted/internet/defer.py", line 1274, in unwindGenerator
    return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
  File "/Users/terence/TIA/lib/python3.5/site-packages/twisted/internet/defer.py", line 1128, in _inlineCallbacks
    result = g.send(result)
  File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 71, in crawl
    self.spider = self._create_spider(*args, **kwargs)
  File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/crawler.py", line 94, in _create_spider
    return self.spidercls.from_crawler(self, *args, **kwargs)
  File "/Users/terence/TIA/lib/python3.5/site-packages/scrapy/spiders/__init__.py", line 50, in from_crawler
    spider = cls(*args, **kwargs)
  File "text_analysis.py", line 17, in __init__
    self.start_urls = ['https://www.techinasia.com/wp-json/techinasia/2.0/posts' + url]
builtins.TypeError: Can't convert 'NoneType' object to str implicitly
2016-03-25 20:49:43 [twisted] CRITICAL: 
/xiaomi-still-got-it-bitches
2016-03-25 20:49:43 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.logstats.LogStats']

I can't seem to figure out what's wrong.

0

2 Answers 2

4

The call of process.crawl() must be

process.crawl(PostSpider, url=post_url)

as the definition is

crawl(crawler_or_spidercls, *args, **kwargs)

It expects the spider class (not the instantiated object) as the first argument. All following positional and keyword arguments (*args, **kwargs) are passed on to the spider init function.

Sign up to request clarification or add additional context in comments.

Comments

2

CrawlerProcess.crawl expects a Spider class, not a Spider instance.

You should pass arguments like this:

process.crawl(PostSpider, url=post_url)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.