Python-requests: Check if URL is not HTML webpage

Question

So I have a crawler that uses something like this:

#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
    raise Exception
html = requests.get(baseUrl[0], timeout=3).text

This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs and in linux when I try to run the script it will just print:

Killed

Is there more of an efficient way to catch these non-HTML pages?

Ankush Shah · Accepted Answer · 2014-08-19 21:07:22Z

7

You can send a head request and check the content type. If its text/html then only proceed

r = requests.head(url)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url).text
else:
    print "non html page"

If you just want to make single request then,

r = requests.get(url)
if "text/html" in r.headers["content-type"]:    
    html = r.text
else:
    print "non html page"

edited Aug 19, 2014 at 21:07

answered Aug 19, 2014 at 20:31

Ankush Shah

9588 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

User Over a year ago

Thanks. Does that head request use a lot of bandwidth or time? It should double the time it takes for each request now, right? Is there anyway to merge this into 1 web request for efficiency?

Ankush Shah Over a year ago

HEAD requests should be fast as the server don't return the message-body. It just returns the meta information.

Ankush Shah Over a year ago

Yes it can be merged into one. the requests.get(url) also return the same headers. You can check for the content type there also. Updated the answer.

User Over a year ago

That's awesome! Should I use the in operator instead of == as some sites return more info like: 'text/html; charset=utf-8'?

Ankush Shah Over a year ago

yup, in should be used

Collectives™ on Stack Overflow

Python-requests: Check if URL is not HTML webpage

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related