1

I have wrote (actually I have modified a scraper from the tutorial) sample scraper:

from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Website


class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["cryptocoincharts.info"]
start_urls = [
    "http://www.cryptocoincharts.info/v2/coins/show/drk",
]

def parse(self, response):

    sel = Selector(response)
    sites = sel.xpath('//table[@class="table table-striped"]//tr[7]/td[2]')
    items = []

for site in sites:
    item = Website()
    item['name'] = site.xpath('text()').re('[^\t\n]+')
    items.append(item)
return items

And I got an Processing Error, here is log:

scrapy crawl dmoz -o items.json -t json

2014-05-21 22:26:54+0200 [scrapy] INFO: Scrapy 0.23.0-231-g2bf09b8 started (bot: scrapybot)
2014-05-21 22:26:54+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-21 22:26:54+0200 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['dirbot.spiders'], 'FEED_URI': 'items.json', 'NEWSPIDER_MODULE': 'dirbot.spiders'}
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-21 22:26:54+0200 [scrapy] INFO: Enabled item pipelines: FilterWordsPipeline
2014-05-21 22:26:54+0200 [dmoz] INFO: Spider opened
2014-05-21 22:26:54+0200 [dmoz] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-21 22:26:54+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-21 22:26:54+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-21 22:26:54+0200 [dmoz] DEBUG: Crawled (200) <GET http://www.cryptocoincharts.info/v2/coins/show/drk> (referer: None)
2014-05-21 22:26:54+0200 [dmoz] ERROR: Error processing {'name': [u'0.0160990 BTC',
              u'7.9770495 USD',
              u'5.7816480 EUR',
              u'48.829847 CNY',
              u'4.7026302 GBP',
              u'6.9809075 CHF',
              u'8.6828030 CAD',
              u'811.85225 JPY',
              u'8.5037582 AUD',
              u'83.350117 ZAR',
              u'0.00595524\xa0oz GOLD (= 0.17\xa0grams)',
              u'0.37805922\xa0oz SILVER (= 10.72\xa0grams)']}
    Traceback (most recent call last):
      File "/usr/lib/pymodules/python2.7/scrapy/middleware.py", line 62, in _process_chain
        return process_chain(self.methods[methodname], obj, *args)
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 65, in process_chain
        d.callback(input)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
        self._startRunCallbacks(result)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item
        if word in unicode(item['description']).lower():
      File "/usr/lib/pymodules/python2.7/scrapy/item.py", line 50, in __getitem__
        return self._values[key]
    exceptions.KeyError: 'description'

2014-05-21 22:26:54+0200 [dmoz] ERROR: Error processing {'name': []}
    Traceback (most recent call last):
      File "/usr/lib/pymodules/python2.7/scrapy/middleware.py", line 62, in _process_chain
        return process_chain(self.methods[methodname], obj, *args)
      File "/usr/lib/pymodules/python2.7/scrapy/utils/defer.py", line 65, in process_chain
        d.callback(input)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 362, in callback
        self._startRunCallbacks(result)
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 458, in _startRunCallbacks
        self._runCallbacks()
    --- <exception caught here> ---
      File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 545, in _runCallbacks
        current.result = callback(current.result, *args, **kw)
      File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item
        if word in unicode(item['description']).lower():
      File "/usr/lib/pymodules/python2.7/scrapy/item.py", line 50, in __getitem__
        return self._values[key]
    exceptions.KeyError: 'description'

2014-05-21 22:26:54+0200 [dmoz] INFO: Closing spider (finished)
2014-05-21 22:26:54+0200 [dmoz] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 254,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 4986,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 5, 21, 20, 26, 54, 390997),
     'log_count/DEBUG': 3,
     'log_count/ERROR': 2,
     'log_count/INFO': 7,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 5, 21, 20, 26, 54, 211942)}
2014-05-21 22:26:54+0200 [dmoz] INFO: Spider closed (finished)

I have tried to find out what is going on, but unfortunately I cannot find any reason why it is not exporting item to json file. On Earlier projects scrapy exported multi row data to json without any issues.

1 Answer 1

2

Take a closer look to the traceback, there is the line:

File "/home/me/Desktop/scrapy/dirbot-master/dirbot-master/dirbot/pipelines.py", line 13, in process_item
    if word in unicode(item['description']).lower():

This means that your pipeline is throwing the error while trying to process an item.

Then, see what fields do you fill in the spider:

for site in sites:
    item = Website()
    item['name'] = site.xpath('text()').re('[^\t\n]+')
    items.append(item)

As you see, no description field is set. This is the reason for the error.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.