PHP Regex ignoring nested tags

Question

Hi I'm working on a bug in a CMS and I was hoping someone could give me some help with this messy regex! I need to remove everything inside the {{page? }} tags (where 'page' is a dynamic word), including any nested {{tags}} within them.- except for {{links? }}

In the code below, the regex should remove everything inside the {{homepage? }} tag:

<div id="main">   
    <div id="left">
    {{menu1}}<br />

{{homepage?
    <img src="images/{{timenow}}.gif" width="177" height="217" alt="{{imgname}}" id="biglogo" />
}}

{{links?
    <b>LINKS</b>
}}
</div>
{{menu2}}
</div>

Here's what I have so far. It's getting stuck as soon as it sees the timenow}}

$result=preg_replace("#\{\{(?!links)\S*?\?.*?}}#s","",$result);

Clarification:

There are no {{page? }} sub tags (all subtags are {{thisformat}} ). In other words something like: {{foo? {{links? bar }} baz }} would never occur.

What {{page? }} tags? You mean {{homepage? ... }}? Do you actually want to remove all tags except the links tag? What what happen with {{foo? {{links? bar }} baz }}? Or do you just want to grab the content of links tag(s)? — Qtax
– Qtax, Commented May 18, 2011 at 14:48
That might be simple enough with a recursive regex using the (?R) syntax. In your case you might get away with: "#\{\{(?!links)\w+\?((?R)|.)*}}#s" - but the . should be rewritten to something more specific. — mario
– mario, Commented May 18, 2011 at 14:56
Sorry @Qtax by {{page? }} I meant the word page is dynamic (can be any single word like homepage, links, contact, etc). There are no {{page? }} sub tags (all subtags are {{thisformat}} ) so your example would never occur. @mario - looks promising. I'll give it a blast and report back. — cronoklee
– cronoklee, Commented May 18, 2011 at 15:01

Qtax · Accepted Answer · 2011-05-18 15:18:20Z

2

You can do something like: #\{\{ (?!links\b) \w+ \? (?: \{\{\w+}} | [^{}]+ | \{(?!\{) | }(?!}) )* }}#sx

answered May 18, 2011 at 15:18

Qtax

34k9 gold badges92 silver badges127 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Alan Moore · Accepted Answer · 2011-05-18 16:11:33Z

If I understand it correctly, there's no need for recursive matching here; the {{page? }} tags may contain simple tags like {{this}}, and that's it. In that case, you just have to watch out for the beginning of a nested tag, so you can match the end of that tag when it shows up, then go on looking for either the end of the enclosing {{page? }} tag or the beginning of another nested tag.

$regex='#
  \{\{ (?!links\?) \w++\?     # page-tag start
  (?:
    (?: (?!\{\{|\}\}) . )++   # normal content
  |
    \{\{                      #
    (?: (?!\}\}) . )*+        # embedded tag
    \}\}                      #
  )*+
  \}\}                        # page-tag end
#sx';

The "normal content" part matches one or more of any character, unless the next character is the beginning of a {{ or }} sequence. Once we've started to match an embedded tag, we use the same technique to gobble up its content.

see it in action at ideone.com

cweiske · Accepted Answer · 2011-05-18 14:50:24Z

-2

This is not possible with regex. Read about the millions of failed attempts to parse nested html/xml with regex.

answered May 18, 2011 at 14:50

cweiske

31.4k15 gold badges150 silver badges206 bronze badges

4 Comments

Qtax Over a year ago

he's not parsing *ML tho, and matching recursive structures is easy, but it's probably better to write a parser. :)

cweiske Over a year ago

The problem about parsing XML with regex is the nesting.

Kobi Over a year ago

Don't let the title confuse you - there's actually only one level of nesting in this question, making the language regular.

cweiske Over a year ago

Knowing that would have helped :)

Collectives™ on Stack Overflow

PHP Regex ignoring nested tags

3 Answers 3

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related