PHP get html comments in string and wrap in <pre> tag. Regex or DOM?

Question

I would like to find comment tags in a string that are NOT already inside a <pre> tag, and wrap them in a <pre> tag.

It seems like there's no way of 'finding' comments using the PHP DOM.

I'm using regex to do some of the processing already, however I am very unfamiliar with (have yet to grasp or truly understand) look aheads and look behinds in regex.

For instance I may have the following code;

<!-- Comment 1 -->

<pre>
    <div class="some_html"></div>
    <!-- Comment 2 -->
</pre>

I would like to wrap Comment 1 in <pre> tags, but obviously not Comment 2 as it already resides in a <pre>.

How would this usually be done in RegEx?

Here's kind of what I've understood about negative look arounds, and my attempt at one, I'm clearly doing something very wrong!

(?<!<pre>.*?)(?!.*?</pre>)

Neither of these "links" answer the question. I will put some things I have attempted to make the question more specific. I would rather not use an external library if possible. — Joel
– Joel, Commented Aug 16, 2013 at 9:27
Do you control the input HTML directly, so that you can ensure that there is no JavaScript or Comments containing <pre>, no CDATA blocks, and no nested comments or <pre> blocks? If you can not ensure this, there is probably no sensible solution using regex. If you can, I'll try to give one =) — Jens
– Jens, Commented Aug 16, 2013 at 9:33
@Joel the problem with regex is, PCRE does not support lookbehinds of variable length. So while your attempt is actually pretty sound (except for some greediness problems), it would only work in .NET. This is why it's near impossible to solve this robustly with regex. — Martin Ender
– Martin Ender, Commented Aug 16, 2013 at 9:37

Jens · Accepted Answer · 2013-08-16 09:57:34Z

2

You should really use a DOM parser if you are planning on re-using this code. Every regex approach will fail horribly sooner rather than later when presented with real-world HTML.

Having said that, here's what you could (but should not, see above) do:

First, identify comments, e.g. using

<!-- (?:(?!-->).)*-->

The negative look-ahead block ensures that the .* does not run out of the comment block.

Now, you need to figure out if this comment is inside a <pre> block. The key observation here, is that there is an even number of either <pre> or </pre> elements following every comment NOT already included in one.

So, run through the rest of your text, always in pairs of <pre>s, and check if you arrive at the end.

This would look like

(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

So, together this would be

<!-- (?:(?!-->).)*-->(?=(?:(?!</?pre>).)*(?:</?pre>(?:(?!</?pre>).)*</?pre>(?:(?!</?pre>).)*)*$)

A hurray for write-only code =)

The prominent building block of this expression is (?:(?!</?pre>).) which matches every character that is not the starting bracket of a <pre> or </pre> sequence.

Allowing attributes on the <pre> and proper escaping are left as an exercise for the reader. See this in action at RegExr.

edited Aug 16, 2013 at 9:57

answered Aug 16, 2013 at 9:51

Jens

25.7k9 gold badges80 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Joel Over a year ago

Despite the fact you suggest not actually using this method, I will mark this as the answer as it succinctly answers my question re: regex, and is useful. Thank you! :)

Community · Accepted Answer · 2020-06-20 09:12:55Z

It seems like there's no way of 'finding' comments using the PHP DOM.

Of course you can... Check this code using PHP Simple HTML DOM Parser:

<?php
$text = '<!-- Comment 1 -->

        <pre>
            <div class="some_html"></div>
            <!-- Comment 2 -->
        </pre>';

echo  "<div>Original Text: <xmp>$text</xmp></div>";

$html = str_get_html($text);

$comments = $html->find('comment');

// if find exists
if ($comments) {

  echo '<br>Find function found '. count($comments) . ' results: ';

  foreach($comments as $key=>$com){
    echo '<br>'.$key . ': ' . $com->tag . ' wich contains = <xmp>' . $com->innertext . '</xmp>';
  }
}
else
  echo "Find() fails !";
?>

$com->innertext will give you the comments like ...

You have now just to clean them as you wish. For example using ... Try it HERE

Edit:

Just a note concerning the lookbehind, it MUST have a fixed-width, therefore you cannot use repetition *+ or optional items ?

The bad news is that most regex flavors do not allow you to use just any regex inside a lookbehind, because they cannot apply a regular expression backwards. Therefore, the regular expression engine needs to be able to figure out how many steps to step back before checking the lookbehind.

Therefore, many regex flavors, including those used by Perl and Python, only allow fixed-length strings. You can use any regex of which the length of the match can be predetermined. This means you can use literal text and character classes. You cannot use repetition or optional items. You can use alternation, but only if all options in the alternation have the same length.

Source: http://www.regular-expressions.info/lookaround.html

Using the simple html dom external library, yes. Not using the native PHP DOM class

pguardiario · Accepted Answer · 2013-08-17 02:00:35Z

0

Xpath is your friend:

$xpath = new DOMXpath($doc);

foreach($xpath->query('//comment()[not(ancestor::pre)]') as $comment){
  $pre = $doc->createElement("pre");
  $comment->parentNode->insertBefore($pre, $comment);
  $pre->appendChild($comment);
}

answered Aug 17, 2013 at 2:00

pguardiario

55.2k21 gold badges130 silver badges169 bronze badges

Comments

user257319 · Accepted Answer · 2015-01-30 19:43:52Z

0

its quite easy, using a principle called the stack-counter,
essentially you count the amount of <pre> tags and the amount of </pre> tags until the point in the HTML code your segment is placed.
if there are more <pre> than </pre> - this means that "<pre>..--you are here--..</pre>".
in that case, simply return back the match, unmodified - simple as that.

answered Jan 30, 2015 at 19:43

user257319

Collectives™ on Stack Overflow

PHP get html comments in string and wrap in <pre> tag. Regex or DOM?

4 Answers 4

1 Comment

Edit:

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Edit:

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related