0

I am trying to test an entire list of websites to see if the URLs are valid, and I want to know which ones are not.

import urllib2

filename=open(argfile,'r')
f=filename.readlines()
filename.close()

def urlcheck() :
    for line in f:
        try:
            urllib2.urlopen()
            print "SITE IS FUNCTIONAL"
        except urllib2.HTTPError, e:
            print(e.code)
        except urllib2.URLError, e:
            print(e.args)
urlcheck()
1
  • How does your code not work? Commented Feb 4, 2017 at 14:59

5 Answers 5

1

You have to pass url

def urlcheck() :
    for line in f:
        try:
            urllib2.urlopen(line)
            print line, "SITE IS FUNCTIONAL"
        except urllib2.HTTPError, e:
            print line, "SITE IS NOT FUNCTIONAL"
            print(e.code)
        except urllib2.URLError, e:
            print line, "SITE IS NOT FUNCTIONAL"
            print(e.args)
        except Exception,e:
            print line, "Invalid URL"

Some edge cases or things to consider

Little bit on error codes and HTTPError

Every HTTP response from the server contains a numeric “status code”. Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a “redirection” that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can’t handle, urlopen will raise an HTTPError. Typical errors include ‘404’ (page not found), ‘403’ (request forbidden), and ‘401’ (authentication required).

Even if HTTPError is raised you may check for the error code

  • So sometimes even if the URL is valid and available it may raise HTTPError with code 403,401 etc .
  • Sometime valid urls would give 5xx due to temporary ServerErrors
Sign up to request clarification or add additional context in comments.

Comments

0

I would suggest you to use requests library.

import requests
resp = requests.get('your url')
if not resp.ok:
    print resp.status_code

Comments

0

You have to pass url as a parameter to the urlopen function.

import urllib2

filename=open(argfile,'r')
f=filename.readlines()
filename.close()

def urlcheck() :
    for line in f:
        try:
            urllib2.urlopen(line) # careful here
            print "SITE IS FUNCTIONAL"
        except urllib2.HTTPError, e:
            print(e.code)
        except urllib2.URLError, e:
            print(e.args)
urlcheck()

Comments

0
import urllib2

def check(url): 
    request = urllib2.Request(url)
    request.get_method = lambda : 'HEAD' # gets only headers without body (increase speed)
    request.add_header('Content-Encoding', 'gzip, deflate, br') # gets archived headers (increase speed)
    try:
        response = urllib2.urlopen(request)
        return response.getcode() <= 400
    except Exception:
        return False    

'''
Contents of "/tmp/urls.txt"

http://www.google.com
https://fb.com
http://not-valid
http://not-valid.nvd
not-valid
'''
filename = open('/tmp/urls.txt', 'r')
urls = filename.readlines()
filename.close()

for url in urls:
    print url + ' ' + str(check(url))

Comments

0

I would probably write it like this:

import urllib2

with open('urls.txt') as f:
    urls = [url.strip() for url in f.readlines()]

def urlcheck() :
    for url in urls:
        try:
            urllib2.urlopen(url)
        except (ValueError, urllib2.URLError) as e:
            print('invalid url: {}'.format(url))

urlcheck()

some changes from the OP's original implementation:

  • use a context manager to open/close data file
  • strip newlines from URLs as they are read from file
  • use better variable names
  • switch to more modern exception handling style
  • also catch ValueError for malformed URL's
  • display a more useful error message

example output:

$ python urlcheck.py 
invalid url: http://www.google.com/wertbh
invalid url: htp:/google.com
invalid url: google.com
invalid url: https://wwwbad-domain-zzzz.com

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.