I am facing an issue with the Python's robotparser module. It works fine for a particular URL but starts failing once I perform a specific sequence of steps. Mentioned below are the steps I performed and the outcome:-
This sequence works fine:-
>>> import robotparser
>>> rp = robotparser.RobotFileParser()
>>> url = "http://www.ontheissues.org/robots.txt"
>>> rp.set_url(url)
>>> rp.read()
>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
True
>>>
However, the below mentioned sequence fails for the same steps which I did above:-
>>>> import robotparser
>>>> rp = robotparser.RobotFileParser()
>>>> url = "http://menendez.senate.gov/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://menendez.senate.gov/contact/contact.cfm")
False
>>>>
>>>>
>>>> url = "http://www.ontheissues.org/robots.txt"
>>>> rp.set_url(url)
>>>> rp.read()
>>>> rp.can_fetch("*", "http://www.ontheissues.org/House/Jim_Nussle.htm")
False
>>>>
After debugging it for sometime, I found that it works fine if I create a new object everytime I am using a new URL. This implies, I have to do "rp = robotparser.RobotFileParser()" everytime the URL changes.
I am not sure if my approach is right since if I am given the ability to change the URL, robotparser should be able to handle such cases.
Also, in the above case, it gives 503 error_code when I try to download the link "http://menendez.senate.gov/contact/contact.cfm" using requests.get() or any other way. I looked into the code of robotparser.py and in that file, for the read() method in class RobotFileParser, there is no check for HTTP response codes > 500. I am not sure why those response_codes are not handled, just wanted to get some pointers what could be the reason for not handling those response codes.