python regex problem

Question

s = re.sub(r"<style.*?</style>", "", s)

Isn't this code supposed to remove styles in the s string? Why does it not work? I am trying to remove the following code:

<style type="text/css">
body { ... }
</style>

Any suggestion?

Everytime I see regex parsing HTML, I remember this question: RegEx match open tags except XHTML self-contained tags — Utku Zihnioglu
– Utku Zihnioglu, Commented Aug 11, 2011 at 23:28

eyquem · Accepted Answer · 2011-08-11 23:24:08Z

6

No it's the re.DOTALL flag that is necessary !

re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

http://docs.python.org/library/re.html#re.DOTALL

Edit

In some cases, it may be necessary to have a dot matching all characters (newlines comprised) in a region of a string, and to have a dot matching only non newlines characters in another region of the sting. But using flag re.DOTALL doesn't allow this.

In this case, it's usefull to know the following trick: using [\s\S] to symbolize every character

import re

s = '''alhambra
<style type="text/css">
body { ... }
</style>
toromizuXXXXXXXX
YYYYYYYYYYYYYY'''
print s,'\n'

regx = re.compile("<style[\s\S]*?</style>|(?<=ro)mizu.+")

s = regx.sub('AAA',s)
print s

result

alhambra
<style type="text/css">
body { ... }
</style>
toromizuXXXXXXXX
YYYYYYYYYYYYYY 

alhambra
AAA
toroAAA
YYYYYYYYYYYYYY

edited Aug 11, 2011 at 23:24

answered Aug 11, 2011 at 23:04

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Shaokan Over a year ago

Yes correct, I just came back to say that I've found the solution but here you are! Good answer!

Collectives™ on Stack Overflow

python regex problem

1 Answer 1

Edit

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Edit

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related