1

Hi I'm working on a bug in a CMS and I was hoping someone could give me some help with this messy regex! I need to remove everything inside the {{page? }} tags (where 'page' is a dynamic word), including any nested {{tags}} within them.- except for {{links? }}

In the code below, the regex should remove everything inside the {{homepage? }} tag:

<div id="main">   
    <div id="left">
    {{menu1}}<br />

{{homepage?
    <img src="images/{{timenow}}.gif" width="177" height="217" alt="{{imgname}}" id="biglogo" />
}}

{{links?
    <b>LINKS</b>
}}
</div>
{{menu2}}
</div>

Here's what I have so far. It's getting stuck as soon as it sees the timenow}}

$result=preg_replace("#\{\{(?!links)\S*?\?.*?}}#s","",$result);

Clarification:

There are no {{page? }} sub tags (all subtags are {{thisformat}} ). In other words something like: {{foo? {{links? bar }} baz }} would never occur.

3
  • What {{page? }} tags? You mean {{homepage? ... }}? Do you actually want to remove all tags except the links tag? What what happen with {{foo? {{links? bar }} baz }}? Or do you just want to grab the content of links tag(s)? Commented May 18, 2011 at 14:48
  • That might be simple enough with a recursive regex using the (?R) syntax. In your case you might get away with: "#\{\{(?!links)\w+\?((?R)|.)*}}#s" - but the . should be rewritten to something more specific. Commented May 18, 2011 at 14:56
  • Sorry @Qtax by {{page? }} I meant the word page is dynamic (can be any single word like homepage, links, contact, etc). There are no {{page? }} sub tags (all subtags are {{thisformat}} ) so your example would never occur. @mario - looks promising. I'll give it a blast and report back. Commented May 18, 2011 at 15:01

3 Answers 3

2

You can do something like: #\{\{ (?!links\b) \w+ \? (?: \{\{\w+}} | [^{}]+ | \{(?!\{) | }(?!}) )* }}#sx

Sign up to request clarification or add additional context in comments.

Comments

2

If I understand it correctly, there's no need for recursive matching here; the {{page? }} tags may contain simple tags like {{this}}, and that's it. In that case, you just have to watch out for the beginning of a nested tag, so you can match the end of that tag when it shows up, then go on looking for either the end of the enclosing {{page? }} tag or the beginning of another nested tag.

$regex='#
  \{\{ (?!links\?) \w++\?     # page-tag start
  (?:
    (?: (?!\{\{|\}\}) . )++   # normal content
  |
    \{\{                      #
    (?: (?!\}\}) . )*+        # embedded tag
    \}\}                      #
  )*+
  \}\}                        # page-tag end
#sx';

The "normal content" part matches one or more of any character, unless the next character is the beginning of a {{ or }} sequence. Once we've started to match an embedded tag, we use the same technique to gobble up its content.

see it in action at ideone.com

Comments

-2

This is not possible with regex. Read about the millions of failed attempts to parse nested html/xml with regex.

4 Comments

he's not parsing *ML tho, and matching recursive structures is easy, but it's probably better to write a parser. :)
The problem about parsing XML with regex is the nesting.
Don't let the title confuse you - there's actually only one level of nesting in this question, making the language regular.
Knowing that would have helped :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.