I'm writing a very simple web crawler and trying to parse 'robots.txt' files. I found the robotparser module in the standard library, which should do exactly this. I'm using Python 2.7.2. Unfortunately, my code won't load the 'robots.txt' files correctly, and I can't figure out why.
Here is the relevant snippet of my code:
from urlparse import urlparse, urljoin
import robotparser
def get_all_links(page, url):
links = []
page_url = urlparse(url)
base = page_url[0] + '://' + page_url[1]
robots_url = urljoin(base, '/robots.txt')
rp = robotparser.RobotFileParser()
rp.set_url(robots_url)
rp.read()
for link in page.find_all('a'):
link_url = link.get('href')
print "Found a link: ", link_url
if not rp.can_fetch('*', link_url):
print "Page off limits!"
pass
Here page is a parsed BeautifulSoup object and url is a URL stored as a string. The parser reads in a blank 'robots.txt' file, instead of the one at the specified URL, and returns True to all can_fetch() queries. It looks like it's either not opening the URL or failing to read the text file.
I've tried it in the interactive interpreter, too. This is what happens, using the same syntax as the documentation page.
Python 2.7.2 (default, Aug 18 2011, 18:04:39)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import robotparser
>>> url = 'http://www.udacity-forums.com/robots.txt'
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url(url)
>>> rp.read()
>>> print rp
>>>
The line print rp should print the contents of the 'robots.txt' file, but it returns blank. Even more frustrating, these examples both work perfectly fine as written, but fail when I try my own URL. I'm pretty new to Python, and I can't figure out what's going wrong. As far as I can tell, I'm using the module in the same way as the documentation and examples. Thanks for any help!
UPDATE 1: Here are a few more lines from the interpreter, in case print rp was not a good method to check if 'robots.txt' was read in. The path, host, and url attributes are correct, but the entries from 'robots.txt' have still not been read in.
>>> rp
<robotparser.RobotFileParser instance at 0x1004debd8>
>>> dir(rp)
['__doc__', '__init__', '__module__', '__str__', '_add_entry', 'allow_all', 'can_fetch', 'default_entry', 'disallow_all', 'entries', 'errcode', 'host', 'last_checked', 'modified', 'mtime', 'parse', 'path', 'read', 'set_url', 'url']
>>> rp.path
'/robots.txt'
>>> rp.host
'www.udacity-forums.com'
>>> rp.entries
[]
>>> rp.url
'http://www.udacity-forums.com/robots.txt'
>>>
UPDATE 2: I have solved this problem by using this external library to parse 'robots.txt' files. (But I haven't answered the original question!) After spending some more time in the terminal, my best guess is that robotparser cannot handle certain additions to the 'robots.txt' spec, like Sitemap, and has trouble with blank lines. It will read in files from, e.g. Stack Overflow and Python.org, but not Google, YouTube, or my original Udacity file, which include Sitemap statements and blank lines. I'd still appreciate it if someone smarter than me could confirm or explain this!
entriesis an empty list.