Thanks! I used @nu11p01n73R 's answer from this post, and I got mostly the URLS, but still some some extra "noise" at the beginning and end. I'm ideally looking for it to just print the URL - http://something.some - so the regex would remove the <a herf=" at the beginning of the URL and remove " data-metrics='{"action" : "Click Story 2"}'> at the end of it. I tried modifying the expression to get that, but I'm having trouble that the URL begins and ends with a " - I think that is messing up me regex. Any suggestions?
URLs are embedded like this in .txt file:
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >
I'd love the output to be:
http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
Most recent code I used was:
file = open("/Users/shannonmcgregor/Desktop/npr.txt", 'r')
for line in file:
if re.search('<a href=[^>]*(islamic|praying|marines|comets|dyslexics)', line):
print line
But this returns, for example:
<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war" data-metrics='{"action":"Click Story 1"}' >