3

I'm trying to pull a file from a password protected FTP server. This is the code I'm using:

import scrapy
from scrapy.contrib.spiders import XMLFeedSpider
from scrapy.http import Request
from crawler.items import CrawlerItem

class SiteSpider(XMLFeedSpider):
    name = 'site'
    allowed_domains = ['ftp.site.co.uk']
    itertag = 'item'

    def start_requests(self):
        yield Request('ftp.site.co.uk/feed.xml',
            meta={'ftp_user': 'test', 'ftp_password': 'test'})

    def parse_node(self, response, selector):
        item = CrawlerItem()
        item['title'] = (selector.xpath('//title/text()').extract() or [''])[0]      
        return item

This is the traceback error I get:

        Traceback (most recent call last):                                                              
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1192, in run     
            self.mainLoop()                                                                             
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
            self.runUntilCurrent()                                                                      
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilC
urrent                                                                                                  
            call.func(*call.args, **call.kw)                                                            
          File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/reactor.py", line 41, in __call__   
            return self._func(*self._a, **self._kw)                                                     
        --- <exception caught here> ---                                                                 
          File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 112, in _next_reques
t                                                                                                       
            request = next(slot.start_requests)                                                         
          File "/var/www/spider/crawler/spiders/site.py", line 13, in start_requests                 
            meta={'ftp_user': 'test', 'ftp_password': 'test'})                                
          File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 26, in __i
nit__                                                                                                   
            self._set_url(url)                                                                          
          File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 61, in _se
t_url                                                                                                   
            raise ValueError('Missing scheme in request url: %s' % self._url)                           
        exceptions.ValueError: Missing scheme in request url: ftp.site.co.uk/f
eed.xml  

1 Answer 1

2

You need to add scheme for the URL:

ftp://ftp.site.co.uk

The FTP URL syntax is defined as:

ftp://[<user>[:<password>]@]<host>[:<port>]/<url-path>

Basically, you do this:

yield Request('ftp://ftp.site.co.uk/feed.xml', ...)

Read more about schemas at Wikipedia: http://en.wikipedia.org/wiki/URI_scheme

Sign up to request clarification or add additional context in comments.

6 Comments

Thank you for the reply. I've been unable to find any documentation or examples for scrapy on adding a scheme though.
Well, this is not scrapy specific. It's rather web specific. You add a scheme by just appending it to the URL. Like you have http:// for HTTP URI's.
Lawrence I really appreciate the help. I have the yield Request in my original post though, is that not correct? def start_requests(self): yield Request('ftp.site.co.uk/feed.xml', meta={'ftp_user': 'test', 'ftp_password': 'test'})
It is not correct. You are missing the ftp:// part. Don't let the ftp in ftp.site.co.uk confuse you, that is not the same. You need it to be ftp://ftp.site.co.uk
is there any way to catch this error ? I tried with putting it in downloader as well as spider middlewares but it seems its being thrown before the middlewares.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.