Scrapy python error - Missing scheme in request URL

Question

I'm trying to pull a file from a password protected FTP server. This is the code I'm using:

import scrapy
from scrapy.contrib.spiders import XMLFeedSpider
from scrapy.http import Request
from crawler.items import CrawlerItem

class SiteSpider(XMLFeedSpider):
    name = 'site'
    allowed_domains = ['ftp.site.co.uk']
    itertag = 'item'

    def start_requests(self):
        yield Request('ftp.site.co.uk/feed.xml',
            meta={'ftp_user': 'test', 'ftp_password': 'test'})

    def parse_node(self, response, selector):
        item = CrawlerItem()
        item['title'] = (selector.xpath('//title/text()').extract() or [''])[0]      
        return item

This is the traceback error I get:

        Traceback (most recent call last):                                                              
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1192, in run     
            self.mainLoop()                                                                             
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1201, in mainLoop
            self.runUntilCurrent()                                                                      
          File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 824, in runUntilC
urrent                                                                                                  
            call.func(*call.args, **call.kw)                                                            
          File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/reactor.py", line 41, in __call__   
            return self._func(*self._a, **self._kw)                                                     
        --- <exception caught here> ---                                                                 
          File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 112, in _next_reques
t                                                                                                       
            request = next(slot.start_requests)                                                         
          File "/var/www/spider/crawler/spiders/site.py", line 13, in start_requests                 
            meta={'ftp_user': 'test', 'ftp_password': 'test'})                                
          File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 26, in __i
nit__                                                                                                   
            self._set_url(url)                                                                          
          File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 61, in _se
t_url                                                                                                   
            raise ValueError('Missing scheme in request url: %s' % self._url)                           
        exceptions.ValueError: Missing scheme in request url: ftp.site.co.uk/f
eed.xml

bosnjak · Accepted Answer · 2015-04-26 11:32:06Z

2

You need to add scheme for the URL:

ftp://ftp.site.co.uk

The FTP URL syntax is defined as:

ftp://[<user>[:<password>]@]<host>[:<port>]/<url-path>

Basically, you do this:

yield Request('ftp://ftp.site.co.uk/feed.xml', ...)

Read more about schemas at Wikipedia: http://en.wikipedia.org/wiki/URI_scheme

edited Apr 26, 2015 at 11:32

answered Apr 26, 2015 at 11:30

bosnjak

8,6342 gold badges25 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jimmy Over a year ago

Thank you for the reply. I've been unable to find any documentation or examples for scrapy on adding a scheme though.

bosnjak Over a year ago

Well, this is not scrapy specific. It's rather web specific. You add a scheme by just appending it to the URL. Like you have http:// for HTTP URI's.

Jimmy Over a year ago

Lawrence I really appreciate the help. I have the yield Request in my original post though, is that not correct?

def start_requests(self):         yield Request('ftp.site.co.uk/feed.xml',             meta={'ftp_user': 'test', 'ftp_password': 'test'})

bosnjak Over a year ago

It is not correct. You are missing the ftp:// part. Don't let the ftp in ftp.site.co.uk confuse you, that is not the same. You need it to be ftp://ftp.site.co.uk

Raheel Over a year ago

is there any way to catch this error ? I tried with putting it in downloader as well as spider middlewares but it seems its being thrown before the middlewares.

|

Collectives™ on Stack Overflow

Scrapy python error - Missing scheme in request URL

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related