0

I use this pattern to match every url in a given webpage:

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font></a>
"""

urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', source)

This has worked for me pretty well until now. I found that sometimes it doesn't match the exact url. Like in the example it match as url https://example.com</p> and https://example.com</font></a> inlcuding the closing tags but I can't figure out what is the problem in the regex. I took this code from another stack question.

8
  • 1
    You use a hyphen inside a character class between two symbols, [$-_], that creates a range that can match < and >, and all ASCII digits and uppercase letters, and more. Replace [$-_@.&+] with [-$_@.&+]. Commented Feb 9, 2017 at 8:55
  • see this link stackoverflow.com/questions/499345/… Commented Feb 9, 2017 at 8:56
  • u can also check this stackoverflow.com/questions/6883049/… Commented Feb 9, 2017 at 8:57
  • URLs should be in <a></a> quotes.. Do you have an special input or something? Commented Feb 9, 2017 at 8:59
  • @WiktorStribiżew This matches only the base url, like example.com/1 would match only example.com Commented Feb 9, 2017 at 8:59

1 Answer 1

1

try this,

import re

source = """
<p>https://example.com</p>
... some code
<font color="E80000">https://example.com</font>
https://example.com</p></a>
https://example.com</font></a>
"""
urls = re.findall('(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', source)
print urls
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.