2

I would like to find comment tags in a string that are NOT already inside a <pre> tag, and wrap them in a <pre> tag.

It seems like there's no way of 'finding' comments using the PHP DOM.

I'm using regex to do some of the processing already, however I am very unfamiliar with (have yet to grasp or truly understand) look aheads and look behinds in regex.

For instance I may have the following code;

<!-- Comment 1 -->

<pre>
    <div class="some_html"></div>
    <!-- Comment 2 -->
</pre>

I would like to wrap Comment 1 in <pre> tags, but obviously not Comment 2 as it already resides in a <pre>.

How would this usually be done in RegEx?

Here's kind of what I've understood about negative look arounds, and my attempt at one, I'm clearly doing something very wrong!

(?<!<pre>.*?)<!--.*-->(?!.*?</pre>)

12
  • You could use the PHP Simple HTML DOM Parser instead. Commented Aug 16, 2013 at 9:07
  • Or one of the other countless alternatives. Commented Aug 16, 2013 at 9:12
  • Neither of these "links" answer the question. I will put some things I have attempted to make the question more specific. I would rather not use an external library if possible. Commented Aug 16, 2013 at 9:27
  • Do you control the input HTML directly, so that you can ensure that there is no JavaScript or Comments containing <pre>, no CDATA blocks, and no nested comments or <pre> blocks? If you can not ensure this, there is probably no sensible solution using regex. If you can, I'll try to give one =) Commented Aug 16, 2013 at 9:33
  • 1
    @Joel the problem with regex is, PCRE does not support lookbehinds of variable length. So while your attempt is actually pretty sound (except for some greediness problems), it would only work in .NET. This is why it's near impossible to solve this robustly with regex. Commented Aug 16, 2013 at 9:37

4 Answers 4

2

You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.

Having said that, here's what you could (but should not, see above) do:

First, identify comments, e.g. using

<!-- (?:(?!-->).)*-->

The negative look-ahead block ensures that the .* does not run out of the comment block.

Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.

So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.

This would look like

(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

So, together this would be

<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

A hurray for write-only code =)

The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.

Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.

Sign up to request clarification or add additional context in comments.

1 Comment

Despite the fact you suggest not actually using this method, I will mark this as the answer as it succinctly answers my question re: regex, and is useful. Thank you! :)
1

It seems like there's no way of 'finding' comments using the PHP DOM.

Of course you can... Check this code using PHP Simple HTML DOM Parser:

<?php
$text = '<!-- Comment 1 -->

        <pre>
            <div class="some_html"></div>
            <!-- Comment 2 -->
        </pre>';

echo  "<div>Original Text: <xmp>$text</xmp></div>";

$html = str_get_html($text);

$comments = $html->find('comment');

// if find exists
if ($comments) {

  echo '<br>Find function found '. count($comments) . ' results: ';

  foreach($comments as $key=>$com){
    echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
  }
}
else
  echo "Find() fails !";
?>

$com->innertext will give you the comments like <!-- Comment 1 -->...

You have now just to clean them as you wish. For example using <!--\s*(.*)\s*-->... Try it HERE

Edit:

Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

Source: http://www.regular-expressions.info/lookaround.html

1 Comment

Using the simple html dom external library, yes. Not using the native PHP DOM class
0

Xpath is your friend:

$xpath = new DOMXpath($doc);

foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
  $pre = $doc->createElement("pre");
  $comment->parentNode->insertBefore($pre, $comment);
  $pre->appendChild($comment);
}

Comments

0

its quite easy, using a principle called the stack-counter,
essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed.
if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>".
in that case, simply return back the match, unmodified - simple as that.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.