Extracting URL link using regular expression re - string matching - Python

Question

I've been trying to extract URLs from a text file using re api. any link that starts with http:// , https:// and www.

the file contains texts as well as html source code, html part is easy because i can extract them using BeautifulSoup, but normal text seems to be more challenging. I found this online which seems to be the best implementation of URL extraction however it fails on certain tags, specially it can't handle tags and includes them in the URL. any help is appreciated, because I'm not familiar with string matching at all myself

here is the signature

sp1=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", str(STRING))
sp2=re.findall('www.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(STRING))

examples:

http://www.website.com/science/</span></a><o:p></o:p></span></div><div
www.website.com/library/</span></a></span></i><span
http://awebsite.com/Groups</a><div>

Joel Cornett · Accepted Answer · 2014-01-18 05:46:03Z

10

re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))

The [^\s<>"]+ part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:

<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>

edited Jan 18, 2014 at 5:46

answered May 6, 2012 at 23:49

Joel Cornett

24.8k9 gold badges69 silver badges90 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extracting URL link using regular expression re - string matching - Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related