Regex with multiple lines and HTML tags

Question

I'm writing a script that should open 10 text files in turn (they are source codes from different webpages). I then want the script to go through and replace any instances of <br /> with \n. I then want it to delete the whole header, essentially. In any case, the document always starts with DOCTYPE and the last line before the information that I want ends

"decoration:underline">no year</span><br />

As far as I'm aware, the regex /.../s means 'ignore line breaks', and I've escaped the HTML / that appears in the </span> tag. So far, I have the following

import re
def create_linebreaks(l):
    l = l.replace('<br />', r'\n')
    return l
def clean_up(line):
    line = re.sub(r'/^<!DOCTYPE.+no year<\/span>/s', '', line)
    return line

data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br />  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  <br />  <br />  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  <br />  <br />"""

create_linebreaks(data)
clean_up(data)
print data
raw_input()

All I get out, however is the same string.

Desired output is something like:

"""  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

<b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  """

Parsing HTML with regular expressions is unwise - have you considered using an actual parser (BeautifulSoup, for example)? — jonrsharpe
– jonrsharpe, Commented Sep 17, 2014 at 13:03
if you just want to replace a tag, use str.replace("<br/>", "whateveryouwant); — user1767754
– user1767754, Commented Sep 17, 2014 at 13:05
At this stage, I don't really consider it parsing. I literally just want to delete all the text between the start of a document and a keyword. I'm literally just trying to stop having to go into Notepad++ as doing it by hand. I have already worked out how to extract the information I want from what's left and this is the missing step. — ACrazyChemist
– ACrazyChemist, Commented Sep 17, 2014 at 13:07
When you print the variable data, why would you expect it to change? You're passing that data through your function and returning it as l? Seems like you're wanting to print clean_data(data) instead..... If you want to actually change the value of data, you'd need to do so in your function. (remember the global scope as well) — mnjeremiah
– mnjeremiah, Commented Sep 17, 2014 at 13:13

mhawke · Accepted Answer · 2014-09-17 14:02:44Z

The main problem is that your regex pattern is wrong for Python.

In r'/^<!DOCTYPE.+no year<\/span>/s', the leading / and trailing /s are considered to be part of the pattern, not modifiers of its behaviour. This looks like PCRE regex syntax a la PHP, and it is not supported in Python. Instead, to get . to match any character including newline, you need to set the re.DOTALL flag as shown below.

The other problem is that the return value from create_linebreaks() and clean_up() is not assigned back to data, so the changes are lost.

Also, you don't want a raw string for the newline character in create_linebreaks(), a normal string is fine (otherwise you would replace <br /> with \\n).

import re

def create_linebreaks(l):
    l = l.replace('<br />', '\n')
    return l

def clean_up(line):
    line = re.sub(r'^<!DOCTYPE.+no year<\/span>', '', line, flags=re.DOTALL)
    return line

data = """<!DOCTYPE html><html class='v2' dir='ltr' xmlns='http://www.w3.org/1999/xhtml' xmlns:b='http://www.google.com/2005/gml/b' movie/file/show/episodes is 2763.</p>A LOAD OF OTHER HTML I DON'T WANT TO BE IN THE OUTPUT
<!-- google_ad_section_start(weight=ignore) --><span class="listings"><span style="font-size:large;font-weight:bold; text-decoration:underline">no year</span><br />  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  <br />  <br />  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  <br />  <br />"""

data = create_linebreaks(data)
data = clean_up(data)

>>> print data

  <b><a target="_blank" href="http://movies.netflixable.com/224599">Beautiful Game, The</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.5 stars, 1hr 24m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=The Beautiful Game">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  - English  " />  

  <b><a target="_blank" href="http://movies.netflixable.com/224278">Brave Miss World</a> (no year)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.7 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=Brave Miss World">imdb</a></i>  <img class="cc_img" src="http://bit.ly/VqRKtD" border="0" style="padding:0px !important;" title="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " alt="Closed Captions: ---  - Danish  - Swedish  - Finnish  - Norwegian Bokm&#65533;&#65533;l  " />  


>>>

I have copy-pasted your code and when I ran it, it hasn't given the same printout that you have described above. I'm working of 2.7.8, could that be a factor? EDIT: I don't know what I changed but I fixed it!

Collectives™ on Stack Overflow

Regex with multiple lines and HTML tags

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related