5

So I have a crawler that uses something like this:

#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
    raise Exception
html = requests.get(baseUrl[0], timeout=3).text

This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs and in linux when I try to run the script it will just print:

Killed

Is there more of an efficient way to catch these non-HTML pages?

1 Answer 1

7

You can send a head request and check the content type. If its text/html then only proceed

r = requests.head(url)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url).text
else:
    print "non html page"

If you just want to make single request then,

r = requests.get(url)
if "text/html" in r.headers["content-type"]:    
    html = r.text
else:
    print "non html page"
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks. Does that head request use a lot of bandwidth or time? It should double the time it takes for each request now, right? Is there anyway to merge this into 1 web request for efficiency?
HEAD requests should be fast as the server don't return the message-body. It just returns the meta information.
Yes it can be merged into one. the requests.get(url) also return the same headers. You can check for the content type there also. Updated the answer.
That's awesome! Should I use the in operator instead of == as some sites return more info like: 'text/html; charset=utf-8'?
yup, in should be used

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.