3

I'm writing a very simple web crawler and trying to parse 'robots.txt' files. I found the robotparser module in the standard library, which should do exactly this. I'm using Python 2.7.2. Unfortunately, my code won't load the 'robots.txt' files correctly, and I can't figure out why.

Here is the relevant snippet of my code:

from urlparse import urlparse, urljoin
import robotparser

def get_all_links(page, url):
    links = []
    page_url = urlparse(url)
    base = page_url[0] + '://' + page_url[1]
    robots_url = urljoin(base, '/robots.txt')
    rp = robotparser.RobotFileParser()
    rp.set_url(robots_url)
    rp.read()
    for link in page.find_all('a'):
        link_url = link.get('href')
        print "Found a link: ", link_url
        if not rp.can_fetch('*', link_url):
            print "Page off limits!" 
            pass

Here page is a parsed BeautifulSoup object and url is a URL stored as a string. The parser reads in a blank 'robots.txt' file, instead of the one at the specified URL, and returns True to all can_fetch() queries. It looks like it's either not opening the URL or failing to read the text file.

I've tried it in the interactive interpreter, too. This is what happens, using the same syntax as the documentation page.

Python 2.7.2 (default, Aug 18 2011, 18:04:39) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> url = 'http://www.udacity-forums.com/robots.txt'
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url(url)
>>> rp.read()
>>> print rp

>>> 

The line print rp should print the contents of the 'robots.txt' file, but it returns blank. Even more frustrating, these examples both work perfectly fine as written, but fail when I try my own URL. I'm pretty new to Python, and I can't figure out what's going wrong. As far as I can tell, I'm using the module in the same way as the documentation and examples. Thanks for any help!

UPDATE 1: Here are a few more lines from the interpreter, in case print rp was not a good method to check if 'robots.txt' was read in. The path, host, and url attributes are correct, but the entries from 'robots.txt' have still not been read in.

>>> rp
<robotparser.RobotFileParser instance at 0x1004debd8>
>>> dir(rp)
['__doc__', '__init__', '__module__', '__str__', '_add_entry', 'allow_all', 'can_fetch', 'default_entry', 'disallow_all', 'entries', 'errcode', 'host', 'last_checked', 'modified', 'mtime', 'parse', 'path', 'read', 'set_url', 'url']
>>> rp.path
'/robots.txt'
>>> rp.host
'www.udacity-forums.com'
>>> rp.entries
[]
>>> rp.url
'http://www.udacity-forums.com/robots.txt'
>>> 

UPDATE 2: I have solved this problem by using this external library to parse 'robots.txt' files. (But I haven't answered the original question!) After spending some more time in the terminal, my best guess is that robotparser cannot handle certain additions to the 'robots.txt' spec, like Sitemap, and has trouble with blank lines. It will read in files from, e.g. Stack Overflow and Python.org, but not Google, YouTube, or my original Udacity file, which include Sitemap statements and blank lines. I'd still appreciate it if someone smarter than me could confirm or explain this!

4
  • By the way, you can see this snippet in context here, in case I left out something relevant. Commented Apr 5, 2012 at 10:18
  • The line print rp should print the contents of the 'robots.txt' file - are you sure about that? Commented Apr 5, 2012 at 10:40
  • Pretty sure. When I used the external examples I linked, this is how it behaved. Just in case, I updated my question with some more information from the interpreter. The URL attributes all look right, but entries is an empty list. Commented Apr 5, 2012 at 10:55
  • I am having the same problem, I tried parsing google.com/robots.txt using the lib you mentioned (nikitathespider.com/python/rerp) and when I try can_fetch("*", "/catalogs/p?") return me False, even that it's Allowed. I am in doubt here. Any clue about this? Commented Mar 11, 2013 at 18:03

2 Answers 2

2

I have solved this problem by using this external library to parse 'robots.txt' files. (But I haven't answered the original question!) After spending some more time in the terminal, my best guess is that robotparser cannot handle certain additions to the 'robots.txt' spec, like Sitemap, and has trouble with blank lines. It will read in files from, e.g. Stack Overflow and Python.org, but not Google, YouTube, or my original Udacity file, which include Sitemap statements and blank lines. I'd still appreciate it if someone smarter than me could confirm or explain this!

Sign up to request clarification or add additional context in comments.

Comments

0

A solution could be using reppy module

pip install reppy

Here are a few examples;

In [1]: import reppy

In [2]: x = reppy.fetch("http://google.com/robots.txt")

In [3]: x.atts
Out[3]: 
{'agents': {'*': <reppy.agent at 0x1fd9610>},
 'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
  'http://www.google.com/hostednews/sitemap_index.xml',
  'http://www.google.com/sitemaps_webmasters.xml',
  'http://www.google.com/ventures/sitemap_ventures.xml',
  'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
  'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
  'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
  'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}

In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True

In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False

In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True

In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?

In [9]: x.ttl
Out[9]: 3721.3556718826294

Voila!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.