using python regex to extract certain URLs from text

Question

So I have the HTML from an NPR page, and I want to use regex to extract just certain URLs for me (these call the URLs to specific stories nested within the page). The actual links appear in the text (retrieved manually) as:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

obviously, I cannot to continue to use manual retrieval if I want to be able to use this on a consistent basis. So far, I have this code:

import nltk
import re

f = open("/Users/shannonmcgregor/Desktop/npr.txt")
npr_lines = f.readlines()
f.close()

I have this code to grab everything between (

for line in npr_lines:
re.findall('<a href="?\'?([^"\'>]*)', line)

But that grabs all urls. I tried adding something like:

(parallels|thetwo-way|a-marines)

but that returns nothing. So what am I doing wrong? How I combine the larger URL stripper with these specific words that target the given URLs?

Please and thank you :)

Use an HTML parser instead, crummy.com/software/BeautifulSoup — Benjamin Toueg
– Benjamin Toueg, Commented Nov 19, 2014 at 9:02
could you post the contents of /Users/shannonmcgregor/Desktop/npr.txt file along with the expected output? — Avinash Raj
– Avinash Raj, Commented Nov 19, 2014 at 9:14

Avinash Raj · Accepted Answer · 2014-11-19 16:58:16Z

2

Through a tool which is specially designed for parsing html and xml files [BeautifulSoup],

>>> from bs4 import BeautifulSoup
>>> s = """<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">"""
>>> soup = BeautifulSoup(s) # or pass the file directly into BS like >>> soup = BeautifulSoup(open('/Users/shannonmcgregor/Desktop/npr.txt'))
>>> atag = soup.find_all('a')
>>> links = [i['href'] for i in atag]
>>> import re
>>> for i in links:
        if re.match(r'.*(parallels|thetwo-way|a-marines).*', i):
            print(i)


http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament
http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear
http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice
http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help

edited Nov 19, 2014 at 16:58

answered Nov 19, 2014 at 9:29

Avinash Raj

175k32 gold badges247 silver badges289 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Eric Over a year ago

re.match(r'.*(parallels|thetwo-way|a-marines).*', i) is better spelt re.search(r'parallels|thetwo-way|a-marines', i) in this case

nu11p01n73R · Accepted Answer · 2014-11-19 09:20:11Z

You can use re.search function to match the regex in the line and prints the line if it matches as

>>> file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')
>>> for line in file:
...     if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):
...             print line

will give an output as

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

asontu · Accepted Answer · 2014-11-19 09:15:20Z

0

You can do this by using a lookahead:

<a href="?\'?((?=[^"\'>]*(?:thetwo\-way|parallels|a\-marines))[^"\'>]+)

Regular expression visualization

Debuggex Demo

answered Nov 19, 2014 at 9:15

asontu

4,6591 gold badge24 silver badges30 bronze badges

Collectives™ on Stack Overflow

using python regex to extract certain URLs from text

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related