Python - Regex matching urls in page source code

Question

I use this pattern to match every url in a given webpage:

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font></a>
"""

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', source)

This has worked for me pretty well until now. I found that sometimes it doesn't match the exact url. Like in the example it match as url https://example.com</p> and https://example.com</font></a> inlcuding the closing tags but I can't figure out what is the problem in the regex. I took this code from another stack question.

You use a hyphen inside a character class between two symbols, [$-_], that creates a range that can match < and >, and all ASCII digits and uppercase letters, and more. Replace [$-_@.&+] with [-$_@.&+]. — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Feb 9, 2017 at 8:55
u can also check this stackoverflow.com/questions/6883049/… — bob marti
– bob marti, Commented Feb 9, 2017 at 8:57
URLs should be in <a></a> quotes.. Do you have an special input or something? — Ika8
– Ika8, Commented Feb 9, 2017 at 8:59
@WiktorStribiżew This matches only the base url, like example.com/1 would match only example.com — Hyperion
– Hyperion, Commented Feb 9, 2017 at 8:59

Arun · Accepted Answer · 2017-02-09 09:24:02Z

1

try this,

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font>
https://example.com</p></a>
https://example.com</font></a>
"""
urls = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', source)
print urls

answered Feb 9, 2017 at 9:24

Arun

1,2891 gold badge12 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python - Regex matching urls in page source code

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related