PHP Regular expression to capture code

Question

I have been trying to capture code blocks in a similar fashion to wiki tags:

{{code:
      code goes here
   }}

Example code is shown below,

$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach ($strings as $testcase) {
    if (ctype_alnum($testcase)) {
        echo "It is The string $testcase consists of all letters or digits.\n";
    } else {
        echo "The string $testcase does not consist of all letters or digits.\n";
    }
}

Essentially I want to capture anything between the {{..}}. There are multiple blocks like this embedded in an HTML page.

I would appreciate any help.

When you say you want to capture "anything between the {{..}}", do you mean you want to include or exclude the "code:" part? — Asaph
– Asaph, Commented Jun 18, 2011 at 18:11
What kind of "code" block is this? A programming language? Or something else? — Asaph
– Asaph, Commented Jun 18, 2011 at 18:22

Asaph · Accepted Answer · 2011-06-18 18:42:00Z

5

Well to start off, regex is not a good way to solve this problem. The right approach is to write a parser that understands language semantics and can tease out the subtleties. Having said that, if you still want a quick and dirty regex based approach that will work 99.99% of the time but has a couple of acknowledged bugs (see end of answer), Here you go:

You can use preg_match_all(). Here is a proof of concept:

$input = "
<html>
    <head>
        <title>{{code:echo 'Hello World';}}</title>
    </head>
    <body>
        <h1>{{code:\$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach (\$strings as \$testcase) {
    if (ctype_alnum(\$testcase)) {
        echo \"It is The string \$testcase consists of all letters or digits.\\n\";
    } else {
        echo \"The string $testcase does not consist of all letters or digits.\\n\";
    }
}
}}</h1>
    </body>
</html>
";

$matches = array();
preg_match_all('/{{code:([^\x00]*?)}}/', $input, $matches);

print_r($matches[1]);

Outputs the following:

Array
(
    [0] => echo 'Hello World';
    [1] => $strings = array('AbCd1zyZ9', 'foo!#');
foreach ($strings as $testcase) {
    if (ctype_alnum($testcase)) {
        echo "It is The string $testcase consists of all letters or digits.\n";
    } else {
        echo "The string  does not consist of all letters or digits.\n";
    }
}

)

Be careful. There are some edge case bugs involving early termination by encountering }} within a "code" block:

If }} appears in a quoted string, the regex matches too early
If } is the last character of your "code" block and it's immediately followed by }}, you'll lose the closing } from your code block.

edited Jun 18, 2011 at 18:42

answered Jun 18, 2011 at 18:08

Asaph

163k25 gold badges204 silver badges204 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

yannisl Over a year ago

@Asaph Thanks! Will try it out, how can I use the result to say place it in new ,<div>

dlras2 Over a year ago

Just be careful with this. If the code block contains }} anywhere it will fail and end early. There's no great way to avoid this besides trying to parse the code (look for uneven numbers of quotes before }}, etc.)

Asaph Over a year ago

@yannis: I'm not sure what exactly you mean. Perhaps if you post some code, what you are trying to do will become clearer. It won't fit in the comments though so please post a new question for that part. Thanks.

Asaph Over a year ago

@Cyclotis04: We don't know what kind of "code" block this is. Don't assume it's a programming language. If this is code is a programming language, then maybe }} could validly appear quoted within the "code". But the OP doesn't mention this so I didn't try to handle it.

Asaph Over a year ago

@Cyclotis04: I learned the [^\x00] trick from Jeffrey Friedl's Mastering Regular Expressions book. It's a well supported trick. I highly recommend the book too.

|

Community · Accepted Answer · 2017-05-23 11:55:38Z

2

As I've said in the comments, Asaph's answer is a good solid regex, but breaks down when }} is contained within the code block. Hopefully this won't be a problem, but as there is a possibility of it, it would be best make your regex a little more expansive. If we can assume that any }} appearing between two single-quotes does not signify the end of the code, as in Asaph's example of <div>{{code:$myvar = '}}';}}</div>, we can expand our regex a bit:

{{code:((?:[^']*?'[^']*?')*?[^']*?)}}

[^']*?' looks for a set of non-' characters, followed by a single quote, and [^']*?'[^']*?' looks for two of them in succession. This "swallows" strings like '}}'. We lazily look for any number of these strings, then the rest of any non-string code with [^']*?, and finally our ending }}.

This allows us to match the entire string {{code:$myvar = '}}';}} rather than just {{code:$myvar = '}}.

There are still problems with this method, however. Escaping a quote within a string, such as in {{code:$myvar = '\'}}\'';}} will not work, as we will "swallow" '\' first, and end with the }} immediately following. It may be possible to determine these escaped single-quotes as well, or to add in support for double-quoted strings, but you need to ask yourself at what point using a code-parser is a better idea.

See the entire Regex in action here. (If it doesn't match anything at first, just click the window.)

how can I use the result to say place it in new ,<div>

Use the replace function:

preg_replace($expression, "<div>$0</div>", $input)

$0 inserts the entire match, and will place it between a new <div> block. Alternatively, if you just want the actual source code, use $1, as we captured the source code in a separate capture group.

Again, see the replacement here.

I went deeper down the rabbit hole...

{{code:((?:(?:[^']|\\')*?(?<!\\)'(?:[^']|\\')*?(?<!\\)')*?(?:[^']|\\')*?)}}

This won't break with escaped single-quotes, and correctly matches {{code:$myvar = '\'}}\'';}}.

Ta-da.

edited May 23, 2017 at 11:55

CommunityBot

11 silver badge

answered Jun 19, 2011 at 2:26

dlras2

8,5148 gold badges54 silver badges92 bronze badges

4 Comments

Asaph Over a year ago

+1 for taking the regex one step further but also recognizing that there are still failing test cases and continuing to work through them with a purely regex based approach is asking for trouble.

Asaph Over a year ago

PHP supports a variety of ways to quote strings: single quotes, double quotes, heredoc and nowdoc. This regex is just the tip of the iceberg...

dlras2 Over a year ago

@Asaph - It is indeed just the tip of the iceberg. Frankly, I'd almost rather write my own PHP parser than try and write that massive regex. To be honest I mostly did this to see what it would take, knowing regular languages can't really count anything arbitrarily.

yannisl Over a year ago

Thanks for all the help. You are right probably the best way is to write a short parser, my guess it will probably also perform faster.

lovesh · Accepted Answer · 2011-06-18 18:11:34Z

0

use

preg_match_all("/{{(.)*}}/", $text, $match)

where text is the text that might contain code this captures anything between {{ }}

answered Jun 18, 2011 at 18:11

lovesh

5,41110 gold badges65 silver badges97 bronze badges

Collectives™ on Stack Overflow

PHP Regular expression to capture code

3 Answers 3

9 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

9 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related