Parsing Robots.txt in python

Question

I want to parse robots.txt file in python. I have explored robotParser and robotExclusionParser but nothing really satisfy my criteria. I want to fetch all the diallowedUrls and allowedUrls in a single shot rather then manually checking for each url if it is allowed or not. Is there any library to do this?

Can I ask what robot.txt contains, and what you mean by parse the text file? — George Willcox
– George Willcox, Commented Mar 29, 2017 at 6:19
robots.txt is a standard which is followed by every sitemap support. Sitemap : To make our content searchable. — Ritu Bhandari
– Ritu Bhandari, Commented Mar 29, 2017 at 6:19
example : fortune.com/robots.txt robotstxt.org/robotstxt.html — Ritu Bhandari
– Ritu Bhandari, Commented Mar 29, 2017 at 6:20
Ah okay that makes more sense now, maybe you should link to this in your question for others that are unfamiliar with this concept. — George Willcox
– George Willcox, Commented Mar 29, 2017 at 6:26
since robot.txt data is in <pre> tag, you cannot use html parse here, there is an alternate option disallow = [ i for i in data.split('\n') if 'Disallow' in i] — akash karothiya
– akash karothiya, Commented Mar 29, 2017 at 6:26

J. Doe · Accepted Answer · 2024-01-14 17:00:01Z

Why do you have to check your URLs manually? You can use urllib.robotparser in Python 3, and do something like this:

import urllib.robotparser as urobot
import urllib.request
from bs4 import BeautifulSoup


url = "example.com"
rp = urobot.RobotFileParser()
rp.set_url(url + "/robots.txt")
rp.read()
if rp.can_fetch("*", url):
    site = urllib.request.urlopen(url)
    sauce = site.read()
    soup = BeautifulSoup(sauce, "html.parser")
    actual_url = site.geturl()[:site.geturl().rfind('/')]
    
    my_list = soup.find_all("a", href=True)
    for i in my_list:
        # rather than != "#" you can control your list before loop over it
        if i != "#":
            newurl = str(actual_url)+"/"+str(i)
            try:
                if rp.can_fetch("*", newurl):
                    site = urllib.request.urlopen(newurl)
                    # do what you want on each authorized webpage
            except:
                pass
else:
    print("cannot scrape")

Yaman Jain · Accepted Answer · 2017-03-29 06:40:13Z

2

You can use curl command to read the robots.txt file into a single string split it with new line check for allow and disallow urls.

import os
result = os.popen("curl https://fortune.com/robots.txt").read()
result_data_set = {"Disallowed":[], "Allowed":[]}

for line in result.split("\n"):
    if line.startswith('Allow'):    # this is for allowed url
        result_data_set["Allowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info
    elif line.startswith('Disallow'):    # this is for disallowed url
        result_data_set["Disallowed"].append(line.split(': ')[1].split(' ')[0])    # to neglect the comments or other junk info

print (result_data_set)

answered Mar 29, 2017 at 6:40

Yaman Jain

1,24511 silver badges17 bronze badges

1 Comment

Yaman Jain Over a year ago

you are welcome. nope @Ritu, couldn't find which suffices your use case. May be you can extend it and build library.

socrates · Accepted Answer · 2023-09-28 11:12:45Z

0

Actually, RobotFileParser can do the job, consider the following code

def iterate_rules(robots_content):
    rfp = RobotFileParser()
    rfp.parse(robots_content.splitlines())
    entries = [rfp.default_entry, *rfp.entries]\
              if rfp.default_entry else rfp.entries
    for entry in entries:
        for ruleline in entry.rulelines:
            yield (entry.useragents, ruleline.path, ruleline.allowance)

from my post on medium

answered Sep 28, 2023 at 11:12

socrates

1,32111 silver badges17 bronze badges

Comments

Tejas Tank · Accepted Answer · 2023-10-06 09:14:29Z

0

I like to share smallest code.

sitemap_urls = re.findall(r'[sS][iI][tT][eE][mM][aA][pP]:\s*(.*?)\s*', response.text, re.IGNORECASE)
print("sitemap_urls", sitemap_urls)

Pretty easy to extract, where sitemap might in capital or small or proper case.

Please test and share your valuable feedback

answered Oct 6, 2023 at 9:14

Tejas Tank

1,2362 gold badges18 silver badges31 bronze badges

Collectives™ on Stack Overflow

Parsing Robots.txt in python

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related