2

I want to parse robots.txt file in python. I have explored robotParser and robotExclusionParser but nothing really satisfy my criteria. I want to fetch all the diallowedUrls and allowedUrls in a single shot rather then manually checking for each url if it is allowed or not. Is there any library to do this?

6
  • Can I ask what robot.txt contains, and what you mean by parse the text file? Commented Mar 29, 2017 at 6:19
  • robots.txt is a standard which is followed by every sitemap support. Sitemap : To make our content searchable. Commented Mar 29, 2017 at 6:19
  • example : fortune.com/robots.txt robotstxt.org/robotstxt.html Commented Mar 29, 2017 at 6:20
  • Ah okay that makes more sense now, maybe you should link to this in your question for others that are unfamiliar with this concept. Commented Mar 29, 2017 at 6:26
  • since robot.txt data is in <pre> tag, you cannot use html parse here, there is an alternate option disallow = [ i for i in data.split('\n') if 'Disallow' in i] Commented Mar 29, 2017 at 6:26

4 Answers 4

9

Why do you have to check your URLs manually? You can use urllib.robotparser in Python 3, and do something like this:

import urllib.robotparser as urobot
import urllib.request
from bs4 import BeautifulSoup


url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
    site = urllib.request.urlopen(url)
    sauce = site.read()
    soup = BeautifulSoup(sauce, "html.parser")
    actual_url = site.geturl()[:site.geturl().rfind('/')]
    
    my_list = soup.find_all("a", href=True)
    for i in my_list:
        # rather than != "#" you can control your list before loop over it
        if i != "#":
            newurl = str(actual_url)+"/"+str(i)
            try:
                if rp.can_fetch("*", newurl):
                    site = urllib.request.urlopen(newurl)
                    # do what you want on each authorized webpage
            except:
                pass
else:
    print("cannot scrape")
Sign up to request clarification or add additional context in comments.

Comments

2

You can use curl command to read the robots.txt file into a single string split it with new line check for allow and disallow urls.

import os
result = os.popen("curl https://fortune.com/robots.txt").read()
result_data_set = {"Disallowed":[], "Allowed":[]}

for line in result.split("\n"):
    if line.startswith('Allow'):    # this is for allowed url
        result_data_set["Allowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info
    elif line.startswith('Disallow'):    # this is for disallowed url
        result_data_set["Disallowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info

print (result_data_set)

1 Comment

you are welcome. nope @Ritu, couldn't find which suffices your use case. May be you can extend it and build library.
0

Actually, RobotFileParser can do the job, consider the following code

def iterate_rules(robots_content):
    rfp = RobotFileParser()
    rfp.parse(robots_content.splitlines())
    entries = [rfp.default_entry, *rfp.entries]\
              if rfp.default_entry else rfp.entries
    for entry in entries:
        for ruleline in entry.rulelines:
            yield (entry.useragents, ruleline.path, ruleline.allowance)

from my post on medium

Comments

0

I like to share smallest code.

sitemap_urls = re.findall(r'[sS][iI][tT][eE][mM][aA][pP]:\s*(.*?)\s*', response.text, re.IGNORECASE)
print("sitemap_urls", sitemap_urls)

Pretty easy to extract, where sitemap might in capital or small or proper case.

Please test and share your valuable feedback

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.