3

I'm not so able with regex and I'm looking for the syntax to exclude something. I'm parsing <, >, " and & in html code (to replace with &lt;, etc) and I need to exclude <br/> from parsing. I.E.:

<html><br/>
   <head><title></title></head><br/>
   <body><br/>
   </body><br/>
</html>

I tried sometihng like i.e.: r'<\b?![br]' and others, but they don't work completely. I use re.sub() to replace.

15
  • 2
    @stdio you don't need external libraries; Python comes with the excellent ElementTree (an API which lxml provides an even better implementation of) out of the box. Commented Sep 4, 2011 at 19:11
  • 1
    XML (like SGML, which it extends) is not a regular language (in the computer science meaning of the term -- if you've taken a compiler design class, they should go into it). Regular expressions are not powerful enough to parse it. Commented Sep 4, 2011 at 19:13
  • 2
    @Charles Most modern regular expression implementation (including Python's) aren't truly regular. Also closing this answer as a duplicate of that joke post helps the OP in no way. Commented Sep 4, 2011 at 19:17
  • 4
    This was erroneously closed as an exact duplicate OF A JOKE ANSWER!!! How much more stupid and lame — and wrong — can you possibly get? Voting to reopen. The guy needs deserves to have his question answer. This BURN THE WITCH attitude around here is absolutely too damned much! Commented Sep 4, 2011 at 19:24
  • 2
    Unless I'm missing something, and once it's just <br/> (not any variants), then can just replace <(?!br/>) with &lt; and (?<!<br/)> with &gt; and that's it? Commented Sep 4, 2011 at 20:04

3 Answers 3

3

Ok, now the question is open again, I can do it as an answer, so...

Unless I'm missing something, and once it's just <br/> (not any variants), then can just replace <(?!br/>) with &lt; and (?<!<br/)> with &gt; and that's it?


In Python, it looks like that means this:

text = re.sub( '<(?!br/>)' , '&lt;' , text )
text = re.sub( '(?<!<br/)>' , '&gt;' , text )


To explain what's going on, (?!...) is a negative lookahead - it only successfully matches at a position if the following text does not match the sub-expression it contains.
(Note lookaheads do not consume the text matched by their sub-expression, they only verify if it exists, or not.)

Similarly, (?<!...) is a negative lookbehind, and does the same thing but using the preceding text.

However, lookbehinds do have a slight different to lookaheads (in some regex implementations) - which is that the sub-expressions inside lookbehinds must represent fixed-width or limited-width matches.

Python is one of the ones that requires a fixed width - so whilst the above expression works (because it's always four characters), if it was (?<!<br\s*/?)> then it would not be a valid regex for Python because it represents a variable length match. (However, you can stack multiple lookbehinds, so you could potentially manually iterate the assorted options, if that was necessary.)

Sign up to request clarification or add additional context in comments.

2 Comments

I already said: perfect ;) Now, is there a way to do all in a step? For regex no problem, I can use 'or' operator (|), but is there a way to pass to re.sub() multiple value as second parameter?
You're replacing with different things, so you can't really do it in one step. Well, I think PHP lets you pass in an array (for both regex and replacement), but this isn't mentioned in the Python docs, so would need to be a user-defined function if it's that important. Of course, you can probably also do re.sub( '<(?!br/>)' , '&lt;' , re.sub( '(?<!<br/)>' , '&gt;' , text ) ) if it's just a case of not wanting a temporary variable.
0

Replace everything, then in a second pass replace "&lt;br/&gt;" with "<br/>".

Or, to generalize, have a list of tags you want to 'revert' and replace "&lt;tag&gt;" with "<tag>", "&lt;/tag&gt;" with "</tag>" and "&lt;tag/&gt;" with "<tag/>".

9 Comments

Something better and more elegant? however I prefer to use regex.
@stdio: But this answer does use regex. Once you’re converted everything, just undo the tag you didn’t want to really change.
@tchrist: yes, but it's not so elegant and I prefer to use re.sub() making all in a step (excluding 'br' tag parsing through regex).
@stdio: Please edit your answer and show what you’re currently doing so I can see where to modify it. Why are you doing this anyway? Some BB posting you need to launder of all HTML or something?
@stdio: Well, if you want something more elegant.... Something tells me that your html wasn't born like this <html><br/><body><br/>. Go to the source of the problem and remove the premature insertion of the <br/>s, then insert them at the end of each line after escaping the tags.
|
0

Does this correspond to what you need ? :

import re
import htmlentitydefs

ss = '''
<html>
    <br>
        <title>"War & Peace"</title>
        <body>Leon Tolstoy</body>
    <br/>
</html>'''

print ss
print '\n\n'


uniquechars_repl = '"&'
conditional_repl = {'<':'<(?!br/>)',
                    '>':'(?<!<br/)>'}

all_repl = list(uniquechars_repl) + conditional_repl.keys()

di = dict( (b,'&%s;' % a) for a,b in htmlentitydefs.entitydefs.iteritems()
           if b in all_repl)

pat = '|'.join(list(uniquechars_repl) + conditional_repl.values())

text = re.sub(pat , lambda mat: di[mat.group()], ss )

print text

result

<html>
    <br>
        <title>"War & Peace"</title>
        <body>Leon Tolstoy</body>
    <br/>
</html>




&lt;html&gt;
    &lt;br&gt;
        &lt;title&gt;&quot;War &amp; Peace&quot;&lt;/title&gt;
        &lt;body&gt;Leon Tolstoy&lt;/body&gt;
    <br/>
&lt;/html&gt;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.