Python robotparser module giving wrong results

Question

I am facing an issue with the Python's robotparser module. It works fine for a particular URL but starts failing once I perform a specific sequence of steps. Mentioned below are the steps I performed and the outcome:-

This sequence works fine:-

>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> url = "http://www.ontheissues.org/robots.txt"
>>> rp.set_url(url)
>>> rp.read()
>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
True
>>>

However, the below mentioned sequence fails for the same steps which I did above:-

>>>> import robotparser
>>>> rp = robotparser.RobotFileParser()
>>>> url = "http://menendez.senate.gov/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://menendez.senate.gov/contact/contact.cfm")
False
>>>>
>>>>
>>>> url = "http://www.ontheissues.org/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
False
>>>>

After debugging it for sometime, I found that it works fine if I create a new object everytime I am using a new URL. This implies, I have to do "rp = robotparser.RobotFileParser()" everytime the URL changes.

I am not sure if my approach is right since if I am given the ability to change the URL, robotparser should be able to handle such cases.

Also, in the above case, it gives 503 error_code when I try to download the link "http://menendez.senate.gov/contact/contact.cfm" using requests.get() or any other way. I looked into the code of robotparser.py and in that file, for the read() method in class RobotFileParser, there is no check for HTTP response codes > 500. I am not sure why those response_codes are not handled, just wanted to get some pointers what could be the reason for not handling those response codes.

When I just tried to access menendez.senate.gov/contact/contact.cfm I got an HTTP 404 response and this page menendez.senate.gov/404. — user4322779
– user4322779, Commented Jun 24, 2015 at 23:42
I am getting 503 for this url. >>> import requests >>> requests.get("menendez.senate.gov/contact/contact.cfm") <Response [503]> — Rahul
– Rahul, Commented Jun 24, 2015 at 23:49
503 means Service Unavailable and 404 means Not Found. Either way something is wrong with the website. Maybe someone is working on it and its not the fault of robotparser.py. If you want it to handle an error code then put it in. A reason for not handling 503 that it may be "due to a temporary overloading or maintenance of the server" while 404 generally represents a more permanent condition. — user4322779
– user4322779, Commented Jun 25, 2015 at 13:32
Looking at the robotparser manual (docs.python.org/2/library/robotparser.html) it should be given an URL that ends in robots.txt and has a structure as documented at robotstxt.org/orig.html. menendez.senate.gov/contact/contact.cfm is not a robots.txt file so robotparser cannot parse it. Admittedly it should return an error that indicates this instead of an HTTP error, however the website has problems and robotparser does not appear to be able to read the URL in order to determine that it cannot parse it. — user4322779
– user4322779, Commented Jun 25, 2015 at 13:41
Yeah, I am passing the URL to robotparser as menendez.senate.gov/robots.txt, so there I think I am ok. Not handling 5XX errors due to temporary overloading or maintenance makes sense. — Rahul
– Rahul, Commented Jun 25, 2015 at 20:11

user4322779 · Accepted Answer · 2015-06-25 15:26:23Z

robotparser can parse only files in "/robots.txt" format as specified at http://www.robotstxt.org/orig.html and for such files to be active in excluding robot traversals they must be located at /robots.txt on a website. Based on this, robotparser should not be able to parse "http://menendez.senate.gov/contact/contact.cfm" because it is probably not in "/robots.txt" format, even if there were no problems accessing it.

Facebook has a robots.txt file at https://www.facebook.com/robots.txt. It is in plain text and can be read in a browser. robotparser can parse it with no problems, however its access to other files on facebook.com appears to be excluded with the following rule in robots.txt:

User-agent: *
Disallow: /

Here is a session using robotparser to read and parse https://www.facebook.com/robots.txt:

>>> from urllib import robotparser
>>> rp = robotparser.RobotFileParser()
>>> rp.set_url("https://www.facebook.com/robots.txt")
>>> rp.read()  # no error
>>> rp.can_fetch("*", "https://www.facebook.com/")
False
>>> rp.can_fetch("*", "https://www.facebook.com/about/privacy")
False

When testing access to http://www.ontheissues.org/robots.txt in my browser, I got HTTP Error 404 - File or directory not found. Then I downloaded http://svn.python.org/projects/python/branches/release22-maint/Lib/robotparser.py, modified its read() function to print every line it read, ran it on this URL and printed only the first line:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

This line indicates the format of http://www.ontheissues.org/robots.txt is incorrect for a "/robots.txt" file although it may redirect to one.

Doing the same test on "https://www.facebook.com/robots.txt" again resulted in only one line, this time with a warning message:

# Notice: Crawling Facebook is prohibited unless you have express written

Testing http://menendez.senate.gov/contact/contact.cfm with the modified robotparser.read() function again resulted in an HTML header simliar but not identical to that of http://www.ontheissues.org/robots.txt and with no errors. Here is the header line it printed for http://menendez.senate.gov/contact/contact.cfm:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Browsing http://menendez.senate.gov/contact/contact.cfm again, it initially results in http://www.menendez.senate.gov/404 which redirects after 10-15 seconds to http://www.menendez.senate.gov/. Such a redirect link can be coded as follows:

<meta http-equiv="refresh" content="15;url=http://www.menendez.senate.gov/" />

Searching the source of http://www.menendez.senate.gov/contact/ finds no match for "cfm" showing it contains no link to contact.cfm. Although such a link could be configured elsewhere in the web server or dynamically generated, it's not likely given that browsing it results in an HTTP 404 error at http://www.menendez.senate.gov/404.

Yeah, it does makes sense. But you should have used the menendez.senate.gov/robots.txt file to be parsed by robotparser as mentioned in my question's trial-code.

Collectives™ on Stack Overflow

Python robotparser module giving wrong results

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related