20

I've been trying to figure out what the best way to validate a URL is (specifically in Python) but haven't really been able to find an answer. It seems like there isn't one known way to validate a URL, and it depends on what URLs you think you may need to validate. As well, I found it difficult to find an easy to read standard for URL structure. I did find the RFCs 3986 and 3987, but they contain much more than just how it is structured.

Am I missing something, or is there no one standard way to validate a URL?

2

6 Answers 6

27

This looks like it might be a duplicate of How do you validate a URL with a regular expression in Python?

You should be able to use the urlparse library described there.

>>> from urllib.parse import urlparse # python2: from urlparse import urlparse
>>> urlparse('actually not a url')
ParseResult(scheme='', netloc='', path='actually not a url', params='', query='', fragment='')
>>> urlparse('http://google.com')
ParseResult(scheme='http', netloc='google.com', path='', params='', query='', fragment='')

call urlparse on the string you want to check and then make sure that the ParseResult has attributes for scheme and netloc

Sign up to request clarification or add additional context in comments.

7 Comments

You might want to use rfc3987 (pypi.python.org/pypi/rfc3987) or do more processing on the urlparse result. urlparse won't actually validate a netloc as an "internet url" -- i got bitten by this too. `urlparse('invalidurl') will give you a netloc + scheme.
It does look like that's a more strict parser, but rfc3987 lets through both of those cases as well (999.999.999.999.999.999 and http://examplecom).
In python3 import urllib.parse as urlparse
@gies0r this should probably be from urllib.parse import urlparse as the code above imports the whole parse module
So "x://a.bc.1" is a valid URL (scheme='x', netloc='a.bc.1') and apple.de not (scheme='', netloc='') !? Not really practical…
|
23

The original question is a bit old, but you might also want to look at the Validator-Collection library I released a few months back. It includes high-performing regex-based validation of URLs for compliance against the RFC standard. Some details:

  • Tested against Python 2.7, 3.4, 3.5, 3.6, 3.7, and 3.8
  • No dependencies on Python 3.x, one conditional dependency in Python 2.x (drop-in replacement for Python 2.x's buggy re module)
  • Unit tests that cover 100+ different succeeding/failing URL patterns, including non-standard characters and the like. As close to covering the whole spectrum of the RFC standard as I've been able to find.

It's also very easy to use:

from validator_collection import validators, checkers

checkers.is_url('http://www.stackoverflow.com')
# Returns True

checkers.is_url('not a valid url')
# Returns False

value = validators.url('http://www.stackoverflow.com')
# value set to 'http://www.stackoverflow.com'

value = validators.url('not a valid url')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('https://123.12.34.56:1234')
# value set to 'https://123.12.34.56:1234'

value = validators.url('http://10.0.0.1')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('http://10.0.0.1', allow_special_ips = True)
# value set to 'http://10.0.0.1'

In addition, Validator-Collection includes about 60+ other validators, including IP addresses (IPv4 and IPv6), domains, and email addresses as well, so something folks might find useful.

4 Comments

This looks like a really nice package. I haven't tried it yet, but it deserves more than 0 upvotes :-).
This only works with domain names - it doesn't appear to like ip addresses though. proxy.remote.http: 'XX.XXX.X.XXX:XXXX' is not a url. proxy.remote.https: 'XX.XXX.X.XXX:XXXX' is not a url.
Note sure I understand what exactly you mean. The value XX.XXX.X.XXX:XXXX will never validate correctly because a) it does not have a valid protocol, and because b) the port (:XXXX) is not expressed as a valid port address. If you try to validate http://XX.XXX.X.XXX:1234 that will validate correctly. If you try to validate an IP http://123.165.43.12:1234 that will validate as well. What's the exact issue that you're encountering?
Also - a follow-up: there are certain special IP addresses (like loopback IPs like 127.0.0.1 or 0.0.0.0) which are considered special cases by the RFCs for URLs and IP addresses. By default, they will fail validation. However, you can have them be allowed (pass validation) by passing the allow_special_ips = True parameter to the validator function. More details in the documentation.
1

I would use the validators package. Here is the link to the documentation and installation instructions.

It is just as simple as

import validators
url = 'YOUR URL'
validators.url(url)

It will return true if it is, and false if not.

3 Comments

The following fails print(validators.url("apple.com"))
@Larytet Because that's not a valid url.
However, I found a case in which validators fails. https:// seekingalpha dot com/article/4353927/track?type=cli....traºnner_utm_.... Elimintating the extra stuff with "..." The "º" is not detected and validators returns True. In fact, this URL is not valid
0
def is_link(url):
    url_regex = r'\b((http|https|ftp):\/\/[a-z0-9-]+(\.[a-z0-9-]+)+([\/?].*)?)\b'
    return bool(re.match(url_regex, url, re.IGNORECASE))

2 Comments

Thank you for your interest in contributing to the Stack Overflow community. This question already has a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?
It's bad practice to reinvent the wheel when tried and tested tools are available. There's a good chance your regex won't catch edge cases you haven't considered.
-1

you can also try using urllib.request to validate by passing the URL in the urlopen function and catching the exception for URLError.

from urllib.request import urlopen, URLError

def validate_web_url(url="http://google"):
    try:
        urlopen(url)
        return True
    except URLError:
        return False

This would return False in this case

1 Comment

Would this work when your working machine has no internet connection?
-2

Assuming you are using python 3, you could use urllib. The code would go something like this:

import urllib.request as req
import urllib.parse as p

def foo():
    url = 'http://bar.com'
    request = req.Request(url)
    try:
        response = req.urlopen(request)
        #response is now a string you can search through containing the page's html
    except:
        #The url wasn't valid

If there is no error on the line "response = ..." then the url is valid.

2 Comments

This only works if the host has an internet connection, which may not always be true.
It would be preferable to not have to use an internet connection to determine if the URL is valid. Also using Python 2.7, should have specified that in the original question.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.