using python regex to extract clean URLs

Question

Thanks! I used @nu11p01n73R 's answer from this post, and I got mostly the URLS, but still some some extra "noise" at the beginning and end. I'm ideally looking for it to just print the URL - http://something.some - so the regex would remove the <a herf=" at the beginning of the URL and remove " data-metrics='{"action" : "Click Story 2"}'> at the end of it. I tried modifying the expression to get that, but I'm having trouble that the URL begins and ends with a " - I think that is messing up me regex. Any suggestions?

URLs are embedded like this in .txt file:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

I'd love the output to be:

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

Most recent code I used was:

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print line

But this returns, for example:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >

@AvinashRaj - nothing wrong with beautiful soup (it's beautiful), just trying to use regex because I need to get more comfortable with them and this helps with that. — shannimcg
– shannimcg, Commented Nov 19, 2014 at 17:35

Avinash Raj · Accepted Answer · 2014-11-19 17:55:53Z

1

Regex is not the right tool to parse html files. Because you intend, i post this solution.

>>> import re
>>> file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            i = re.sub(r'^.*?<a href="([^"]*)".*', r'\1', i)
            print(i)

OR

>>> for i in file:
        if re.search('<a href="[^>"]*(islamic|praying|marines|comets|dyslexics)', i):
            print(re.search(r'^.*?<a href="([^"]*)".*', i).group(1))

edited Nov 19, 2014 at 17:55

answered Nov 19, 2014 at 17:50

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AlexWei Over a year ago

why not just use group instead of sub

AlexWei Over a year ago

Yes. That is what I means.

nu11p01n73R · Accepted Answer · 2014-11-19 17:48:04Z

0

You can use re.findall function to extract the content as

file  = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
    if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
        print re.findall(r'(?<=")[^"]*(?=")', line)[0]

will produce an output as

http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war

answered Nov 19, 2014 at 17:48

nu11p01n73R

26.8k3 gold badges42 silver badges52 bronze badges

Collectives™ on Stack Overflow

using python regex to extract clean URLs

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related