Validating URLs in Python

Question

I've been trying to figure out what the best way to validate a URL is (specifically in Python) but haven't really been able to find an answer. It seems like there isn't one known way to validate a URL, and it depends on what URLs you think you may need to validate. As well, I found it difficult to find an easy to read standard for URL structure. I did find the RFCs 3986 and 3987, but they contain much more than just how it is structured.

Am I missing something, or is there no one standard way to validate a URL?

what are you asking? You want to know if a domain is in a correct format? Where is your code? — Trent
– Trent, Commented Mar 6, 2014 at 23:11
possible duplicate of How do you validate a URL with a regular expression in Python? — Blair
– Blair, Commented Mar 6, 2014 at 23:17

faruk13 · Accepted Answer · 2019-02-24 06:18:27Z

27

This looks like it might be a duplicate of How do you validate a URL with a regular expression in Python?

You should be able to use the urlparse library described there.

>>> from urllib.parse import urlparse # python2: from urlparse import urlparse
>>> urlparse('actually not a url')
ParseResult(scheme='', netloc='', path='actually not a url', params='', query='', fragment='')
>>> urlparse('http://google.com')
ParseResult(scheme='http', netloc='google.com', path='', params='', query='', fragment='')

call urlparse on the string you want to check and then make sure that the ParseResult has attributes for scheme and netloc

edited Feb 24, 2019 at 6:18

faruk13

1,3461 gold badge16 silver badges24 bronze badges

answered Mar 6, 2014 at 23:12

bgschiller

2,1371 gold badge17 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Jonathan Vanasco Over a year ago

You might want to use rfc3987 (pypi.python.org/pypi/rfc3987) or do more processing on the urlparse result. urlparse won't actually validate a netloc as an "internet url" -- i got bitten by this too. `urlparse('invalidurl') will give you a netloc + scheme.

bgschiller Over a year ago

It does look like that's a more strict parser, but rfc3987 lets through both of those cases as well (999.999.999.999.999.999 and http://examplecom).

gies0r Over a year ago

In python3 import urllib.parse as urlparse

basse Over a year ago

@gies0r this should probably be from urllib.parse import urlparse as the code above imports the whole parse module

oxidworks Over a year ago

So "x://a.bc.1" is a valid URL (scheme='x', netloc='a.bc.1') and apple.de not (scheme='', netloc='') !? Not really practical…

|

Chris Modzelewski · Accepted Answer · 2020-01-25 00:20:33Z

23

The original question is a bit old, but you might also want to look at the Validator-Collection library I released a few months back. It includes high-performing regex-based validation of URLs for compliance against the RFC standard. Some details:

Tested against Python 2.7, 3.4, 3.5, 3.6, 3.7, and 3.8
No dependencies on Python 3.x, one conditional dependency in Python 2.x (drop-in replacement for Python 2.x's buggy re module)
Unit tests that cover 100+ different succeeding/failing URL patterns, including non-standard characters and the like. As close to covering the whole spectrum of the RFC standard as I've been able to find.

It's also very easy to use:

from validator_collection import validators, checkers

checkers.is_url('http://www.stackoverflow.com')
# Returns True

checkers.is_url('not a valid url')
# Returns False

value = validators.url('http://www.stackoverflow.com')
# value set to 'http://www.stackoverflow.com'

value = validators.url('not a valid url')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('https://123.12.34.56:1234')
# value set to 'https://123.12.34.56:1234'

value = validators.url('http://10.0.0.1')
# raises a validator_collection.errors.InvalidURLError (which is a ValueError)

value = validators.url('http://10.0.0.1', allow_special_ips = True)
# value set to 'http://10.0.0.1'

In addition, Validator-Collection includes about 60+ other validators, including IP addresses (IPv4 and IPv6), domains, and email addresses as well, so something folks might find useful.

edited Jan 25, 2020 at 0:20

answered Jul 22, 2018 at 20:48

Chris Modzelewski

1,65812 silver badges12 bronze badges

4 Comments

Dave Over a year ago

This looks like a really nice package. I haven't tried it yet, but it deserves more than 0 upvotes :-).

FiferJanis Over a year ago

This only works with domain names - it doesn't appear to like ip addresses though. proxy.remote.http: 'XX.XXX.X.XXX:XXXX' is not a url. proxy.remote.https: 'XX.XXX.X.XXX:XXXX' is not a url.

Chris Modzelewski Over a year ago

Note sure I understand what exactly you mean. The value XX.XXX.X.XXX:XXXX will never validate correctly because a) it does not have a valid protocol, and because b) the port (:XXXX) is not expressed as a valid port address. If you try to validate http://XX.XXX.X.XXX:1234 that will validate correctly. If you try to validate an IP http://123.165.43.12:1234 that will validate as well. What's the exact issue that you're encountering?

Chris Modzelewski Over a year ago

Also - a follow-up: there are certain special IP addresses (like loopback IPs like 127.0.0.1 or 0.0.0.0) which are considered special cases by the RFCs for URLs and IP addresses. By default, they will fail validation. However, you can have them be allowed (pass validation) by passing the allow_special_ips = True parameter to the validator function. More details in the documentation.

Tony Hammack · Accepted Answer · 2018-07-17 21:06:20Z

1

I would use the validators package. Here is the link to the documentation and installation instructions.

It is just as simple as

import validators
url = 'YOUR URL'
validators.url(url)

It will return true if it is, and false if not.

answered Jul 17, 2018 at 21:06

Tony Hammack

111 bronze badge

3 Comments

Larytet Over a year ago

The following fails print(validators.url("apple.com"))

Joshua Wolff Over a year ago

@Larytet Because that's not a valid url.

Joshua Wolff Over a year ago

However, I found a case in which validators fails. https:// seekingalpha dot com/article/4353927/track?type=cli....traºnner_utm_.... Elimintating the extra stuff with "..." The "º" is not detected and validators returns True. In fact, this URL is not valid

milad fallahi · Accepted Answer · 2024-05-18 04:42:55Z

0

def is_link(url):
    url_regex = r'\b((http|https|ftp):\/\/[a-z0-9-]+(\.[a-z0-9-]+)+([\/?].*)?)\b'
    return bool(re.match(url_regex, url, re.IGNORECASE))

answered May 18, 2024 at 4:42

milad fallahi

373 bronze badges

2 Comments

Jeremy Caney Over a year ago

Thank you for your interest in contributing to the Stack Overflow community. This question already has a few answers—including one that has been extensively validated by the community. Are you certain your approach hasn’t been given previously? If so, it would be useful to explain how your approach is different, under what circumstances your approach might be preferred, and/or why you think the previous answers aren’t sufficient. Can you kindly edit your answer to offer an explanation?

Michael Scheper Jun 11 at 16:34

It's bad practice to reinvent the wheel when tried and tested tools are available. There's a good chance your regex won't catch edge cases you haven't considered.

Hamza · Accepted Answer · 2018-07-19 06:23:29Z

-1

you can also try using urllib.request to validate by passing the URL in the urlopen function and catching the exception for URLError.

from urllib.request import urlopen, URLError

def validate_web_url(url="http://google"):
    try:
        urlopen(url)
        return True
    except URLError:
        return False

This would return False in this case

edited Jul 19, 2018 at 6:23

answered Jul 18, 2018 at 10:44

Hamza

876 bronze badges

1 Comment

winklerrr Over a year ago

Would this work when your working machine has no internet connection?

mdw7326 · Accepted Answer · 2014-03-06 23:26:43Z

-2

Assuming you are using python 3, you could use urllib. The code would go something like this:

import urllib.request as req
import urllib.parse as p

def foo():
    url = 'http://bar.com'
    request = req.Request(url)
    try:
        response = req.urlopen(request)
        #response is now a string you can search through containing the page's html
    except:
        #The url wasn't valid

If there is no error on the line "response = ..." then the url is valid.

answered Mar 6, 2014 at 23:26

mdw7326

1811 gold badge1 silver badge9 bronze badges

2 Comments

bgschiller Over a year ago

This only works if the host has an internet connection, which may not always be true.

mp94 Over a year ago

It would be preferable to not have to use an internet connection to determine if the URL is valid. Also using Python 2.7, should have specified that in the original question.

Collectives™ on Stack Overflow

Validating URLs in Python

6 Answers 6

7 Comments

4 Comments

3 Comments

2 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

7 Comments

4 Comments

3 Comments

2 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related