Checking URLs with Python

Question

I am trying to test an entire list of websites to see if the URLs are valid, and I want to know which ones are not.

import urllib2

filename=open(argfile,'r')
f=filename.readlines()
filename.close()

def urlcheck() :
    for line in f:
        try:
            urllib2.urlopen()
            print "SITE IS FUNCTIONAL"
        except urllib2.HTTPError, e:
            print(e.code)
        except urllib2.URLError, e:
            print(e.args)
urlcheck()

How does your code not work?

Jongware
– Jongware

2017-02-04 14:59:19 +00:00
Commented Feb 4, 2017 at 14:59 — Jongware
– Jongware, Commented Feb 4, 2017 at 14:59

Sarath Sadasivan Pillai · Accepted Answer · 2017-02-04 15:11:00Z

You have to pass url

def urlcheck() :
    for line in f:
        try:
            urllib2.urlopen(line)
            print line, "SITE IS FUNCTIONAL"
        except urllib2.HTTPError, e:
            print line, "SITE IS NOT FUNCTIONAL"
            print(e.code)
        except urllib2.URLError, e:
            print line, "SITE IS NOT FUNCTIONAL"
            print(e.args)
        except Exception,e:
            print line, "Invalid URL"

Some edge cases or things to consider

Little bit on error codes and HTTPError

Every HTTP response from the server contains a numeric “status code”. Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a “redirection” that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can’t handle, urlopen will raise an HTTPError. Typical errors include ‘404’ (page not found), ‘403’ (request forbidden), and ‘401’ (authentication required).

Even if HTTPError is raised you may check for the error code

So sometimes even if the URL is valid and available it may raise HTTPError with code 403,401 etc .
Sometime valid urls would give 5xx due to temporary ServerErrors

Kroustou · Accepted Answer · 2017-02-04 14:55:32Z

0

I would suggest you to use requests library.

import requests
resp = requests.get('your url')
if not resp.ok:
    print resp.status_code

answered Feb 4, 2017 at 14:55

Kroustou

6595 silver badges14 bronze badges

Comments

Hasan Alper Ocalan · Accepted Answer · 2017-02-04 14:57:46Z

0

You have to pass url as a parameter to the urlopen function.

import urllib2

filename=open(argfile,'r')
f=filename.readlines()
filename.close()

def urlcheck() :
    for line in f:
        try:
            urllib2.urlopen(line) # careful here
            print "SITE IS FUNCTIONAL"
        except urllib2.HTTPError, e:
            print(e.code)
        except urllib2.URLError, e:
            print(e.args)
urlcheck()

answered Feb 4, 2017 at 14:57

Hasan Alper Ocalan

3181 silver badge9 bronze badges

Comments

cetver · Accepted Answer · 2017-02-04 15:14:26Z

0

import urllib2

def check(url): 
    request = urllib2.Request(url)
    request.get_method = lambda : 'HEAD' # gets only headers without body (increase speed)
    request.add_header('Content-Encoding', 'gzip, deflate, br') # gets archived headers (increase speed)
    try:
        response = urllib2.urlopen(request)
        return response.getcode() <= 400
    except Exception:
        return False    

'''
Contents of "/tmp/urls.txt"

http://www.google.com
https://fb.com
http://not-valid
http://not-valid.nvd
not-valid
'''
filename = open('/tmp/urls.txt', 'r')
urls = filename.readlines()
filename.close()

for url in urls:
    print url + ' ' + str(check(url))

answered Feb 4, 2017 at 15:14

cetver

11.9k6 gold badges42 silver badges59 bronze badges

Comments

Corey Goldberg · Accepted Answer · 2017-02-04 15:17:12Z

I would probably write it like this:

import urllib2

with open('urls.txt') as f:
    urls = [url.strip() for url in f.readlines()]

def urlcheck() :
    for url in urls:
        try:
            urllib2.urlopen(url)
        except (ValueError, urllib2.URLError) as e:
            print('invalid url: {}'.format(url))

urlcheck()

some changes from the OP's original implementation:

use a context manager to open/close data file
strip newlines from URLs as they are read from file
use better variable names
switch to more modern exception handling style
also catch ValueError for malformed URL's
display a more useful error message

example output:

$ python urlcheck.py 
invalid url: http://www.google.com/wertbh
invalid url: htp:/google.com
invalid url: google.com
invalid url: https://wwwbad-domain-zzzz.com

Collectives™ on Stack Overflow

Checking URLs with Python

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related