1

I have created a "lite" URL regex. That means it may not detect all URLs. I created it with the aim of covering simple urls.

#! python3
# urls.py - Detecting urls that begin with http:// or https://

import re

urlRegex = re.compile(r'''(
    (http://|https://)+          # the http(s) part of the url
    (w{3}\.)?              # the world-wide-web part
    ([a-z0-9-])+            # the domain name
    (\.[a-z]{2,4})?        # sub level domain
    (\.[a-z]{2,4})        # top level domain
    (/[-A-Za-z0-9+&@#/%=~_|])* # extension i.e paths
)''', re.VERBOSE)
test = urlRegex.search('https://www.facebook.com/user_2033')

The output of test.groups() was this

('https://www.facebook.com/user_2033', 'https://', 'www.', 'k', None, '.com', '/u')
[Finished in 0.058s]

After numerous attempts, I'm unable to display the complete website name and extension i.e 'facebook' not 'k'. Any help without completely changing my own code would be most appreciated

2 Answers 2

2

(PATTERN)* or (PATTERN)+ will capture the last matched character only. It should be (PATTERN*) or (PATTERN+) to capture all characters.


([a-z0-9-])+          # the domain name

should be replaced with:

([a-z0-9-]+)          # the domain name

Same for the last part:

(/[-A-Za-z0-9+&@#/%=~_|])* # extension i.e paths

(/[-A-Za-z0-9+&@#/%=~_|]*) # extension i.e paths

output:

('https://www.facebook.com/user_2033', 'https://', 's', 'www.',
 'facebook', None, '.com', '/user_2033')

BTW, you can use urllib.parse.urlparse (Python 3) / urlparse.urlparse (Python 2) instead of regular expression:

>>> import urllib.parse
>>> urllib.parse.urlparse('https://www.facebook.com/user_2033')
ParseResult(scheme='https', netloc='www.facebook.com',
            path='/user_2033', params='', query='', fragment='')
Sign up to request clarification or add additional context in comments.

2 Comments

@ITJ, I added explanation at the top of the answer.
@ITJ, I justed add alternative way to parse the url (using urllib.parse.urlparse)
0

i've use the following regex to simply verify URL:

((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.