0

So I have the HTML from an NPR page, and I want to use regex to extract just certain URLs for me (these call the URLs to specific stories nested within the page). The actual links appear in the text (retrieved manually) as:

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

obviously, I cannot to continue to use manual retrieval if I want to be able to use this on a consistent basis. So far, I have this code:

import nltk
import re

f = open("/Users/shannonmcgregor/Desktop/npr.txt")
npr_lines = f.readlines()
f.close()

I have this code to grab everything between (

for line in npr_lines:
re.findall('<a href="?\'?([^"\'>]*)', line)

But that grabs all urls. I tried adding something like:

(parallels|thetwo-way|a-marines)

but that returns nothing. So what am I doing wrong? How I combine the larger URL stripper with these specific words that target the given URLs?

Please and thank you :)

3
  • What's your input and expected output? Commented Nov 19, 2014 at 9:02
  • 2
    Use an HTML parser instead, crummy.com/software/BeautifulSoup Commented Nov 19, 2014 at 9:02
  • could you post the contents of /Users/shannonmcgregor/Desktop/npr.txt file along with the expected output? Commented Nov 19, 2014 at 9:14

3 Answers 3

2

Through a tool which is specially designed for parsing html and xml files [BeautifulSoup],

>>> from bs4 import BeautifulSoup
>>> s = """<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">"""
>>> soup = BeautifulSoup(s) # or pass the file directly into BS like >>> soup = BeautifulSoup(open('/Users/shannonmcgregor/Desktop/npr.txt'))
>>> atag = soup.find_all('a')
>>> links = [i['href'] for i in atag]
>>> import re
>>> for i in links:
        if re.match(r'.*(parallels|thetwo-way|a-marines).*', i):
            print(i)


http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war
http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament
http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear
http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice
http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help
Sign up to request clarification or add additional context in comments.

1 Comment

re.match(r'.*(parallels|thetwo-way|a-marines).*', i) is better spelt re.search(r'parallels|thetwo-way|a-marines', i) in this case
0

You can use re.search function to match the regex in the line and prints the line if it matches as

>>> file  = open('/Users/shannonmcgregor/Desktop/npr.txt', 'r')
>>> for line in file:
...     if re.search('<a href=[^>]*(parallels|thetwo-way|a-marines)', line):
...             print line

will give an output as

<a href="http://www.npr.org/blogs/parallels/2014/11/11/363018388/how-the-islamic-state-wages-its-propaganda-war">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363309020/asked-to-stop-praying-alaska-school-won-t-host-state-tournament">
<a href="http://www.npr.org/2014/11/11/362817642/a-marines-parents-story-their-memories-that-you-should-hear">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363288744/comets-rugged-landscape-makes-landing-a-roll-of-the-dice">
<a href="http://www.npr.org/blogs/thetwo-way/2014/11/11/363293514/for-dyslexics-a-font-and-a-dictionary-that-are-meant-to-help">

Comments

0

You can do this by using a lookahead:

<a href="?\'?((?=[^"\'>]*(?:thetwo\-way|parallels|a\-marines))[^"\'>]+)

Regular expression visualization

Debuggex Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.