I'm writing a script that should open 10 text files in turn (they are source codes from different webpages). I then want the script to go through and replace any instances of <br /> with \n. I then want it to delete the whole header, essentially. In any case, the document always starts with DOCTYPE and the last line before the information that I want ends
"decoration:underline">no year</span><br />
As far as I'm aware, the regex /.../s means 'ignore line breaks', and I've escaped the HTML / that appears in the </span> tag.
So far, I have the following
import re
def create_linebreaks(l):
l = l.replace('<br />', r'\n')
return l
def clean_up(line):
line = re.sub(r'/^<!DOCTYPE.+no year<\/span>/s', '', line)
return line
data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br /> <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " /> <br /> <br /> <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> <br /> <br />"""
create_linebreaks(data)
clean_up(data)
print data
raw_input()
All I get out, however is the same string.
Desired output is something like:
""" <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b> <i style="font-size:small"> 3.5 stars, 1hr 24m <a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l - English " />
<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b> <i style="font-size:small"> 3.7 stars, 1hr 28m <a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i> <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " alt="Closed Captions: --- - Danish - Swedish - Finnish - Norwegian Bokm��l " /> """