Replacing all but specific elements using regex in Python

Question

Using regular expressions in Python, I am trying to remove all XML-type elements in a string, except those containing QUOTE, eg <QUOTE>, </QUOTE> or <QUOTE A="B"> should remain, but others such as <EXAMPLE> or <TEST A="B"> should be removed. I've created this, which replaces all elements but can't work out the not part:

re.sub(r'</?[\w= \-"]+>', '', s)

Any ideas anyone?

Can you give us an example of an XML tag that you wouldn't want to remove? — rmalouf
– rmalouf, Commented Mar 24, 2011 at 18:37

Tom Zych · Accepted Answer · 2011-03-24 20:53:24Z

5

I believe a negative lookahead assertion will do what you want:

import re

regex = r'<(?!/?QUOTE\b)[^>]+>'

tests = [
    'a plain old string',
    'a string with <SOME> <XML TAGS="stuff">',
    'a string with <QUOTE>, </QUOTE>, and <QUOTE with="data">',
    'a string that has <QUOTEA> tags </QUOTEB>',
]

for i in tests:
    result = re.sub(regex, '', i)
    print('{}\n{}\n'.format(i, result))

EDIT: How it works

Lookahead assertions, as the name suggests, "look ahead" in the string being matched, but don't consume the characters they're matching. You can do positive ((?=...)) and negative ((?!...)) lookaheads. (There are also positive and negative lookbehind assertions.)

So, the regex shown matches < for the beginning of a tag, then does a negative lookahead for QUOTE with an optional / before it (/?) and a word boundary behind it (\b). If that's matched, the regex does not match, and that tag is ignored. If it's not matched, the regex goes on to eat one or more non-> characters, and the closing >. I guess you might want to have it eat any whitespace following the tag, too - I didn't do that.

edited Mar 24, 2011 at 20:53

answered Mar 24, 2011 at 19:40

Tom Zych

13.7k9 gold badges38 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

eyquem Over a year ago

This solution is the good one. Not hard to find, but anyway +1

eyquem Over a year ago

However, /? must be out of the lookahead assertion

Tom Zych Over a year ago

@eyquem: It does? Why? The code appears to work correctly: both QUOTE and /QUOTE are ignored.

eyquem Over a year ago

Oh you're right: a character / is matched by [^>] . I was wandering absent-mindedly among posts

rmalouf · Accepted Answer · 2011-03-24 19:01:16Z

1

I'd first replace QUOTE with some weird symbol that doesn't appear in the text, like maybe ^:

s = re.sub(r'(</?)QUOTE','\1^',s)

Then get rid of the XML tags that don't contain your weird symbol:

s = re.sub(r'</?[\w= \-"]+>','',s)

Then put the QUOTEs back in:

s = re.sub(r'(</?)\^','\1QUOTE',s)

EDIT: You can always combine these into one line by composition:

s = re.sub(r'(</?)\^','\1QUOTE',re.sub(r'</?[\w= \-"]+>','',re.sub(r'(</?)QUOTE','\1^',s)))

edited Mar 24, 2011 at 19:01

answered Mar 24, 2011 at 18:42

rmalouf

3,4431 gold badge17 silver badges10 bronze badges

2 Comments

Dean Barnes Over a year ago

That's a nice solution, but is it possible in one regex?

rmalouf Over a year ago

I don't think so... Regexps don't actually have a negation operator.

dythim · Accepted Answer · 2011-03-24 18:56:27Z

0

rmalouf's approach should work.

Here is a potential one-liner.

re.sub(r'<[/]?[^Q][^U][^O][^T][^E][^>]*>', '', s)

[/]? should match the /, when it is present.

[^>]*> matches everything else inside the tag, and the tag closer.

If you are expecting no other tags that start with Q, you can shorten it further:

re.sub(r'<[/]?[^Q][^>]*>', '', s)

edited Mar 24, 2011 at 18:56

answered Mar 24, 2011 at 18:40

dythim

9307 silver badges24 bronze badges

6 Comments

Dean Barnes Over a year ago

That matches the element I DO want to keep though :) Can this be modified to remove all elements BUT those with QUOTE in them?

dythim Over a year ago

Yes sorry, I've edited the post! I hate it when I read the exact OPPOSITE of what something says. :)

rmalouf Over a year ago

Your first regexp will only match tags with at least 5 characters.

Dean Barnes Over a year ago

I think you're right :( For some reason it's also removing </QUOTE> which I can't fathom.

rmalouf Over a year ago

[/]? is matching the empty string between < and /, [^Q] is matching the /, [^U] is matching the Q, and so on.

|

Collectives™ on Stack Overflow

Replacing all but specific elements using regex in Python

3 Answers 3

4 Comments

2 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

2 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related