3

I want a regex that can parse ignoring the nested matches

I mean on this for example:

/*asdasdasd /* asdasdsa */ qweqweqwe */

to match the first "/*" with the last "*/" and not stopping to the first "*/"

Thanks...

8
  • Also when they are inside quoted strings? That namely adds a complete new ingredient to the mix. Commented Mar 18, 2011 at 21:34
  • 4
    You cannot parse unlimited nesting with pure regex. Commented Mar 18, 2011 at 21:34
  • @SLaks When i am trying to match the */ I get the first one, but I want the last one. Commented Mar 18, 2011 at 21:35
  • 1
    "/*asdasdasd /* asdasdsa */ qweqweqwe */".replace(/\/\*.*\*\//, "t") replaces the whole string with t. Commented Mar 18, 2011 at 21:36
  • @Radek S: I don't care for quoted strings. It will have pure text inside! Commented Mar 18, 2011 at 21:36

5 Answers 5

4

RegEx expressions will naturally be greedy, so you can just use:

\/\*.*\*\/

If you wanted it to do what you're afraid of and make the RegEx be lazy and stop after the first match you'd have to add an ? like:

\/\*.*?\*\/
Sign up to request clarification or add additional context in comments.

2 Comments

But with the dot (.), I get only in one line, correctly thought! If I have multiline, is [\w\W]* correct? Or is there a better solution?
Does not work if there are multiple comments with stuff you want to keep between them.
3

Regular expressions can't count nested items by definition (though implementations do go further than the computer scientific definition).

See http://en.wikipedia.org/wiki/Regular_expression#Expressive_power_and_compactness

2 Comments

Yes its true that REGULAR expressions cannot match nested structures, but Perl, PHP and .NET regex sure can.
I basically knew that regexes are more powerful than "REGULAR expressions", but I certainly didn't know that some implementations handle nesting. That's interesting :) Also, it seems to me that ridgerunner's answer is the most correct one.
1

The solutions presented so far work ok if the text has only one nested comment. However, as LHMathies noted, if the text has more than one comment with stuff you want to keep between them, then these solutions fail. For example, here is some test data to verify the algorithm works correctly:

/* one */
Stuff one
/* two /* three */ two */
Stuff two
/* four */

A correct solution will preserve the two lines with stuff in them. To correctly handle this case in Javascript, you need a regex which matches an innermost comment (and this is the hard part), and then apply this repeatedly until all the comments are gone. Here is a tested function which does precisely that:

function strip_nested_C_comments(text)
{ // Regex to match innermost "C" style comment.
    var re = /\/\*[^*\/]*(?:(?!\/\*|\*\/)[*\/][^*\/]*)*\*\//i;
    // Iterate stripping comments from inside out.
    while (text.search(re) != -1) {
        text = text.replace(re, '');
    }
    return text;
}

Edit: Improved regex efficiency for non-match cases. (i.e. changed the "special" from [\S\s] to [*\/]).

Comments

0

Regular expressions aren't good at dealing with nested values, since what you're describing is not a "regular language"

But regular expressions are naturally greedy. That means that * and + quantifiers by default they will do exactly what you're asking for

var data = "/*asdasdasd /* asdasdsa */ qweqweqwe */";
data = data.replace( /\/\*.*\*\//, '' );
alert( 'Data: ' + data );

1 Comment

Does not work if there are multiple comments with stuff you want to keep between them.
0

I'm guessing that you're really after something that will remove or process properly nested comments from a string, even if there's more than one -- the answers giving 'greedy' regexes will go from the first /* to the last */: in strings like keep /* comment */ keep /* comment */ keep they will treat the middle keep as part of the comment.

The short answer is that Javascript RegExps aren't powerful enough to do that, you need recursive patterns. (Also known as regexps can't count).

But, if you just want to remove the comments, you can use a loop and remove the innermost ones first (using the non-greedy RegExp from @mVChr, modified to match the last possible starting delimiter instead of the first):

var re = /(.*)\/\*.*?\*\//; while (re.test(string)) string.replace(re, '$1')

This moves the counting (of nesting levels) out of the regexp and into the loop, so to speak. (I didn't put a g flag on the regexp because I'm unsure of the side effects when using such an regexp in two places in a loop. And the loop takes care of finding all occurrences anyway).

5 Comments

You've got the right idea, but unfortunately your regex does not quite correctly match the "innermost" comment. See my answer for a regex that will (it turns out this is not so simple to do!)
@ridgerunner: You are quite right, \/\*.*?\*\/ will match the outer /* instead of the inner. It's perfect for matching C89 comments, though, if you add a multiline flag. I'm fixing the answer to make the /* match as late as possible instead -- then it even works with greedy matching, still with a simple .* between the delimiter.
Well, not quite, non-greedy matching is still needed.
As I said, the regex to match an "innermost" comment is non-trivial. See my answer for one that does work correctly. It also implements Friedl's "unrolling-the-loop" construct for speed.
@ridgerunner: Do you have a concrete example where (the non-captured part of) my regexp doesn't match the rightmost innermost comment? (Ignoring multiline issues).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.