I'm crawling several URLs to find specific keywords in their source code. However, while crawling half of the websites, my spider suddenly stops due to HTTP errors like 404 or 503.
My crawler:
import urllib2
keyword = ['viewport']
with open('listofURLs.csv') as f:
for line in f:
strdomain = line.strip()
if strdomain:
req = urllib2.Request(strdomain.strip())
response = urllib2.urlopen(req)
html_content = response.read()
for searchstring in keyword:
if searchstring.lower() in str(html_content).lower():
print (strdomain, keyword, 'found')
f.close()
What code should I add to ignore bad URLs with HTTP errors and letting the crawler continue craping?