4

I'm trying to find a certain string that can occur inside a comment block. That string can be a word, but it can also be part of a word. For instance, suppose I'm looking for the word "codex", then this word should be replace with "bindex" but even when it's part of a word, like "codexing". This should be changed to "bindexing".

The trick is, that this should only happen when this word is inside a comment block.

/* Lorem ipsum dolor sit amet, codex consectetur adipiscing elit. */

This word --> codex should not be replaced

/* Lorem ipsum dolor sit 
 * amet, codex consectetur 
 * adipiscing elit. 
 */

/** Lorem ipsum dolor sit 
 * amet, codex consectetur 
 * adipiscing elit. 
 */

// Lorem ipsum dolor sit amet, codex consectetur adipiscing elit.

# Lorem ipsum dolor sit amet, codex consectetur adipiscing elit.

------------------- Below "codex" is part of a word -------------------

/* Lorem ipsum dolor sit amet, somecodex consectetur adipiscing elit. */

/* Lorem ipsum dolor sit 
 * amet, codexing consectetur 
 * adipiscing elit. 
 */

And here also, this word --> codex should not be replaced

/** Lorem ipsum dolor sit 
 * amet, testcodexing consectetur 
 * adipiscing elit. 
 */

// Lorem ipsum dolor sit amet, __codex consectetur adipiscing elit.

# Lorem ipsum dolor sit amet, codex__ consectetur adipiscing elit.

What I have so far is this code:

$text = preg_replace ( '~(\/\/|#|\/\*).*?(codex).*?~', '$1 bindex', $text);

As you can see in this example, this isn't really working the way I'd like. It doesn't replace the word when it's inside a multiline /* */ comment block, And sometimes it removes all the text that was in front of the word "codex" as well.

How can I improve my regex so that it meets my requirements?

5 Answers 5

3

Since you're dealing with multi-line text here you should be using s modifier (DOTALL) to match text across multiple line. Also forward slash doesn't need to be escaped.

Try this code:

$text = preg_replace ( '~(//|#|/\*).*?(codex).*?~s', '$1 bindex', $text );
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks alot, this seems to be doing exactly what i want! :)
2
$text = preg_replace ( '~(//|#|/\*)(.*?)(codex).*?~s', '$1$2bindex', $text );

this not delete comments before 'codex' like in answer from anubhava

Comments

1

[EDIT] I edited this answer because despite my naïve relentlessness at the time, I resolved to admit that it isn't possible to solve this problem with a simple or complicated preg_replace! Sorry for the good soul who had upvoted my answer.[/EDIT]

To answer the question: It's not possible to improve your pattern, It's not possible to do it with preg_replace at all! You have to build a pattern for preg_replace_callback that matches a whole comment and proceed to the replacement of codex occurrences in the callback function.

This version can deal with any type of comments and will not fail with this kind of strings /**/ codex /**/ or /*xxxx codex codex xxxx*/ or any other traps.

$result = preg_replace_callback('~/\*.*?\*/|#\N+|//\N+~s', function($m) {
    return stri_replace('codex', 'bindex', $m[0]);
}, $subject);

Note that in addition to the fact that this pattern is simpler, it is efficient too since each branch of the alternation is "anchored" because they start with a literal character. The pattern therefore benefits from automatic optimizations.

Comments

0

As was written hundreds, thousands or maybe even millions of times before in different comments, Regular Expressions are NOT for parsing code, or searching for errors in one.

Consider these examples:

// code to be replaced
var a = "/*code to be replaced*/";

/* code to be replaced
var b = "*/code to be replaced"; */

There is no way for you to parse the code (and yes, finding out if a string is inside a comment block is called parsing) with REGEX.

Find a parser library, or create a diminished one of your own. If you do create one, remember all the different use-cases of the script, and in particular, how strings will affect your code.

1 Comment

I am not parsing code, i'm searching for a string that is preceeded by (/*|//|#). Nothing a modern regex language can't do. As obviously is proven by a given answer. This is not HTML or XML that i'm trying to parse or anything amoung those lines.
0

Something like this using sub groups should work;

$str = preg_replace(
    '~(<!--[a-zA-Z0-9 \n]*)(MYWORD)([a-zA-Z0-9 \n]*-->)~s',
    '$1$3',
     $input
);

You will just need to create a separate rule for each type of comment, and limit the possible characters allowed inside the comment with a character class (You might prefer to use a negated character class).

3 Comments

That is for HTML comments and won't replace more than one codex per comment block. It doesn't cover line breaks either.
Call it again until it returns no matches.
If you want to allow line breaks in comments you will need to add it to the character class '\n', you should really do something like this with full code rather than trying to short cut it with regex. By using .* You run the risk of <!-- block1 -->MYWORD<!--block2 --> being identified and replaced.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.