regex exclusion in python

Question

I'm not so able with regex and I'm looking for the syntax to exclude something. I'm parsing <, >, " and & in html code (to replace with <, etc) and I need to exclude   from parsing. I.E.:

<html><br/>
   <head><title></title></head><br/>
   <body><br/>
   </body><br/>
</html>

I tried sometihng like i.e.: r'<\b?![br]' and others, but they don't work completely. I use re.sub() to replace.

@stdio you don't need external libraries; Python comes with the excellent ElementTree (an API which lxml provides an even better implementation of) out of the box. — Charles Duffy
– Charles Duffy, Commented Sep 4, 2011 at 19:11
XML (like SGML, which it extends) is not a regular language (in the computer science meaning of the term -- if you've taken a compiler design class, they should go into it). Regular expressions are not powerful enough to parse it. — Charles Duffy
– Charles Duffy, Commented Sep 4, 2011 at 19:13
@Charles Most modern regular expression implementation (including Python's) aren't truly regular. Also closing this answer as a duplicate of that joke post helps the OP in no way. — NullUserException
– NullUserException, Commented Sep 4, 2011 at 19:17
This was erroneously closed as an exact duplicate OF A JOKE ANSWER!!! How much more stupid and lame — and wrong — can you possibly get? Voting to reopen. The guy needs deserves to have his question answer. This BURN THE WITCH attitude around here is absolutely too damned much! — tchrist
– tchrist, Commented Sep 4, 2011 at 19:24
Unless I'm missing something, and once it's just   (not any variants), then can just replace <(?!br/>) with < and (?<!<br/)> with > and that's it? — Peter Boughton
– Peter Boughton, Commented Sep 4, 2011 at 20:04

Peter Boughton · Accepted Answer · 2011-09-04 20:25:43Z

3

Ok, now the question is open again, I can do it as an answer, so...

Unless I'm missing something, and once it's just   (not any variants), then can just replace <(?!br/>) with < and (?<!<br/)> with > and that's it?

In Python, it looks like that means this:

text = re.sub( '<(?!br/>)' , '&lt;' , text )
text = re.sub( '(?<!<br/)>' , '&gt;' , text )

To explain what's going on, (?!...) is a negative lookahead - it only successfully matches at a position if the following text does not match the sub-expression it contains.
(Note lookaheads do not consume the text matched by their sub-expression, they only verify if it exists, or not.)

Similarly, (?<!...) is a negative lookbehind, and does the same thing but using the preceding text.

However, lookbehinds do have a slight different to lookaheads (in some regex implementations) - which is that the sub-expressions inside lookbehinds must represent fixed-width or limited-width matches.

Python is one of the ones that requires a fixed width - so whilst the above expression works (because it's always four characters), if it was (?<!<br\s*/?)> then it would not be a valid regex for Python because it represents a variable length match. (However, you can stack multiple lookbehinds, so you could potentially manually iterate the assorted options, if that was necessary.)

edited Sep 4, 2011 at 20:25

answered Sep 4, 2011 at 20:13

Peter Boughton

113k32 gold badges125 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

stdio Over a year ago

I already said: perfect ;) Now, is there a way to do all in a step? For regex no problem, I can use 'or' operator (|), but is there a way to pass to re.sub() multiple value as second parameter?

Peter Boughton Over a year ago

You're replacing with different things, so you can't really do it in one step. Well, I think PHP lets you pass in an array (for both regex and replacement), but this isn't mentioned in the Python docs, so would need to be a user-defined function if it's that important. Of course, you can probably also do re.sub( '<(?!br/>)' , '<' , re.sub( '(?<!<br/)>' , '>' , text ) ) if it's just a case of not wanting a temporary variable.

Joaquim Rendeiro · Accepted Answer · 2011-09-04 18:51:36Z

0

Replace everything, then in a second pass replace " " with " ".

Or, to generalize, have a list of tags you want to 'revert' and replace "<tag>" with "<tag>", "</tag>" with "</tag>" and "<tag/>" with "<tag/>".

answered Sep 4, 2011 at 18:51

Joaquim Rendeiro

1,3988 silver badges13 bronze badges

9 Comments

stdio Over a year ago

Something better and more elegant? however I prefer to use regex.

tchrist Over a year ago

@stdio: But this answer does use regex. Once you’re converted everything, just undo the tag you didn’t want to really change.

stdio Over a year ago

@tchrist: yes, but it's not so elegant and I prefer to use re.sub() making all in a step (excluding 'br' tag parsing through regex).

tchrist Over a year ago

@stdio: Please edit your answer and show what you’re currently doing so I can see where to modify it. Why are you doing this anyway? Some BB posting you need to launder of all HTML or something?

Joaquim Rendeiro Over a year ago

@stdio: Well, if you want something more elegant.... Something tells me that your html wasn't born like this <html> <body> . Go to the source of the problem and remove the premature insertion of the s, then insert them at the end of each line after escaping the tags.

|

eyquem · Accepted Answer · 2011-09-05 23:57:54Z

Does this correspond to what you need ? :

import re
import htmlentitydefs

ss = '''
<html>
    <br>
        <title>"War & Peace"</title>
        <body>Leon Tolstoy</body>
    <br/>
</html>'''

print ss
print '\n\n'


uniquechars_repl = '"&'
conditional_repl = {'<':'<(?!br/>)',
                    '>':'(?<!<br/)>'}

all_repl = list(uniquechars_repl) + conditional_repl.keys()

di = dict( (b,'&%s;' % a) for a,b in htmlentitydefs.entitydefs.iteritems()
           if b in all_repl)

pat = '|'.join(list(uniquechars_repl) + conditional_repl.values())

text = re.sub(pat , lambda mat: di[mat.group()], ss )

print text

result

<html>
    <br>
        <title>"War & Peace"</title>
        <body>Leon Tolstoy</body>
    <br/>
</html>




&lt;html&gt;
    &lt;br&gt;
        &lt;title&gt;&quot;War &amp; Peace&quot;&lt;/title&gt;
        &lt;body&gt;Leon Tolstoy&lt;/body&gt;
    <br/>
&lt;/html&gt;

Collectives™ on Stack Overflow

regex exclusion in python

3 Answers 3

2 Comments

9 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

9 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related