Python: Incomplete URL Regex Output

Question

I have created a "lite" URL regex. That means it may not detect all URLs. I created it with the aim of covering simple urls.

#! python3
# urls.py - Detecting urls that begin with http:// or https://

import re

urlRegex = re.compile(r'''(
    (http://|https://)+          # the http(s) part of the url
    (w{3}\.)?              # the world-wide-web part
    ([a-z0-9-])+            # the domain name
    (\.[a-z]{2,4})?        # sub level domain
    (\.[a-z]{2,4})        # top level domain
    (/[-A-Za-z0-9+&@#/%=~_|])* # extension i.e paths
)''', re.VERBOSE)
test = urlRegex.search('https://www.facebook.com/user_2033')

The output of test.groups() was this

('https://www.facebook.com/user_2033', 'https://', 'www.', 'k', None, '.com', '/u')
[Finished in 0.058s]

After numerous attempts, I'm unable to display the complete website name and extension i.e 'facebook' not 'k'. Any help without completely changing my own code would be most appreciated

falsetru · Accepted Answer · 2017-07-02 02:06:35Z

2

(PATTERN)* or (PATTERN)+ will capture the last matched character only. It should be (PATTERN*) or (PATTERN+) to capture all characters.

([a-z0-9-])+          # the domain name

should be replaced with:

([a-z0-9-]+)          # the domain name

Same for the last part:

(/[-A-Za-z0-9+&@#/%=~_|])* # extension i.e paths

(/[-A-Za-z0-9+&@#/%=~_|]*) # extension i.e paths

output:

('https://www.facebook.com/user_2033', 'https://', 's', 'www.',
 'facebook', None, '.com', '/user_2033')

BTW, you can use urllib.parse.urlparse (Python 3) / urlparse.urlparse (Python 2) instead of regular expression:

>>> import urllib.parse
>>> urllib.parse.urlparse('https://www.facebook.com/user_2033')
ParseResult(scheme='https', netloc='www.facebook.com',
            path='/user_2033', params='', query='', fragment='')

edited Jul 2, 2017 at 2:06

answered Jul 2, 2017 at 2:03

falsetru

371k69 gold badges769 silver badges659 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

falsetru Over a year ago

@ITJ, I added explanation at the top of the answer.

falsetru Over a year ago

@ITJ, I justed add alternative way to parse the url (using urllib.parse.urlparse)

mark kats · Accepted Answer · 2021-07-14 11:58:51Z

0

i've use the following regex to simply verify URL:

((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*

answered Jul 14, 2021 at 11:58

mark kats

1471 silver badge5 bronze badges

Collectives™ on Stack Overflow

Python: Incomplete URL Regex Output

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related