4

Using regular expressions in Python, I am trying to remove all XML-type elements in a string, except those containing QUOTE, eg <QUOTE>, </QUOTE> or <QUOTE A="B"> should remain, but others such as <EXAMPLE> or <TEST A="B"> should be removed. I've created this, which replaces all elements but can't work out the not part:

re.sub(r'</?[\w= \-"]+>', '', s)

Any ideas anyone?

2
  • Can you give us an example of an XML tag that you wouldn't want to remove? Commented Mar 24, 2011 at 18:37
  • Done. Forgot the backticks :( Commented Mar 24, 2011 at 18:38

3 Answers 3

5

I believe a negative lookahead assertion will do what you want:

import re

regex = r'<(?!/?QUOTE\b)[^>]+>'

tests = [
    'a plain old string',
    'a string with <SOME> <XML TAGS="stuff">',
    'a string with <QUOTE>, </QUOTE>, and <QUOTE with="data">',
    'a string that has <QUOTEA> tags </QUOTEB>',
]

for i in tests:
    result = re.sub(regex, '', i)
    print('{}\n{}\n'.format(i, result))

EDIT: How it works

Lookahead assertions, as the name suggests, "look ahead" in the string being matched, but don't consume the characters they're matching. You can do positive ((?=...)) and negative ((?!...)) lookaheads. (There are also positive and negative lookbehind assertions.)

So, the regex shown matches < for the beginning of a tag, then does a negative lookahead for QUOTE with an optional / before it (/?) and a word boundary behind it (\b). If that's matched, the regex does not match, and that tag is ignored. If it's not matched, the regex goes on to eat one or more non-> characters, and the closing >. I guess you might want to have it eat any whitespace following the tag, too - I didn't do that.

Sign up to request clarification or add additional context in comments.

4 Comments

This solution is the good one. Not hard to find, but anyway +1
However, /? must be out of the lookahead assertion
@eyquem: It does? Why? The code appears to work correctly: both QUOTE and /QUOTE are ignored.
Oh you're right: a character / is matched by [^>] . I was wandering absent-mindedly among posts
1

I'd first replace QUOTE with some weird symbol that doesn't appear in the text, like maybe ^:

s = re.sub(r'(</?)QUOTE','\1^',s)

Then get rid of the XML tags that don't contain your weird symbol:

s = re.sub(r'</?[\w= \-"]+>','',s)

Then put the QUOTEs back in:

s = re.sub(r'(</?)\^','\1QUOTE',s)

EDIT: You can always combine these into one line by composition:

s = re.sub(r'(</?)\^','\1QUOTE',re.sub(r'</?[\w= \-"]+>','',re.sub(r'(</?)QUOTE','\1^',s)))

2 Comments

That's a nice solution, but is it possible in one regex?
I don't think so... Regexps don't actually have a negation operator.
0

rmalouf's approach should work.

Here is a potential one-liner.

re.sub(r'<[/]?[^Q][^U][^O][^T][^E][^>]*>', '', s)

[/]? should match the /, when it is present.

[^>]*> matches everything else inside the tag, and the tag closer.

If you are expecting no other tags that start with Q, you can shorten it further:

re.sub(r'<[/]?[^Q][^>]*>', '', s)

6 Comments

That matches the element I DO want to keep though :) Can this be modified to remove all elements BUT those with QUOTE in them?
Yes sorry, I've edited the post! I hate it when I read the exact OPPOSITE of what something says. :)
Your first regexp will only match tags with at least 5 characters.
I think you're right :( For some reason it's also removing </QUOTE> which I can't fathom.
[/]? is matching the empty string between < and /, [^Q] is matching the /, [^U] is matching the Q, and so on.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.