1

I have been trying to capture code blocks in a similar fashion to wiki tags:

{{code:
      code goes here
   }}

Example code is shown below,

$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach ($strings as $testcase) {
    if (ctype_alnum($testcase)) {
        echo "It is The string $testcase consists of all letters or digits.\n";
    } else {
        echo "The string $testcase does not consist of all letters or digits.\n";
    }
}

Essentially I want to capture anything between the {{..}}. There are multiple blocks like this embedded in an HTML page.

I would appreciate any help.

2
  • 1
    When you say you want to capture "anything between the {{..}}", do you mean you want to include or exclude the "code:" part? Commented Jun 18, 2011 at 18:11
  • What kind of "code" block is this? A programming language? Or something else? Commented Jun 18, 2011 at 18:22

3 Answers 3

5

Well to start off, regex is not a good way to solve this problem. The right approach is to write a parser that understands language semantics and can tease out the subtleties. Having said that, if you still want a quick and dirty regex based approach that will work 99.99% of the time but has a couple of acknowledged bugs (see end of answer), Here you go:

You can use preg_match_all(). Here is a proof of concept:

$input = "
<html>
    <head>
        <title>{{code:echo 'Hello World';}}</title>
    </head>
    <body>
        <h1>{{code:\$strings = array('AbCd1zyZ9', 'foo!#$bar');
foreach (\$strings as \$testcase) {
    if (ctype_alnum(\$testcase)) {
        echo \"It is The string \$testcase consists of all letters or digits.\\n\";
    } else {
        echo \"The string $testcase does not consist of all letters or digits.\\n\";
    }
}
}}</h1>
    </body>
</html>
";

$matches = array();
preg_match_all('/{{code:([^\x00]*?)}}/', $input, $matches);

print_r($matches[1]);

Outputs the following:

Array
(
    [0] => echo 'Hello World';
    [1] => $strings = array('AbCd1zyZ9', 'foo!#');
foreach ($strings as $testcase) {
    if (ctype_alnum($testcase)) {
        echo "It is The string $testcase consists of all letters or digits.\n";
    } else {
        echo "The string  does not consist of all letters or digits.\n";
    }
}

)

Be careful. There are some edge case bugs involving early termination by encountering }} within a "code" block:

  1. If }} appears in a quoted string, the regex matches too early
  2. If } is the last character of your "code" block and it's immediately followed by }}, you'll lose the closing } from your code block.
Sign up to request clarification or add additional context in comments.

9 Comments

@Asaph Thanks! Will try it out, how can I use the result to say place it in new ,<div>
Just be careful with this. If the code block contains }} anywhere it will fail and end early. There's no great way to avoid this besides trying to parse the code (look for uneven numbers of quotes before }}, etc.)
@yannis: I'm not sure what exactly you mean. Perhaps if you post some code, what you are trying to do will become clearer. It won't fit in the comments though so please post a new question for that part. Thanks.
@Cyclotis04: We don't know what kind of "code" block this is. Don't assume it's a programming language. If this is code is a programming language, then maybe }} could validly appear quoted within the "code". But the OP doesn't mention this so I didn't try to handle it.
@Cyclotis04: I learned the [^\x00] trick from Jeffrey Friedl's Mastering Regular Expressions book. It's a well supported trick. I highly recommend the book too.
|
2

As I've said in the comments, Asaph's answer is a good solid regex, but breaks down when }} is contained within the code block. Hopefully this won't be a problem, but as there is a possibility of it, it would be best make your regex a little more expansive. If we can assume that any }} appearing between two single-quotes does not signify the end of the code, as in Asaph's example of <div>{{code:$myvar = '}}';}}</div>, we can expand our regex a bit:

{{code:((?:[^']*?'[^']*?')*?[^']*?)}}

[^']*?' looks for a set of non-' characters, followed by a single quote, and [^']*?'[^']*?' looks for two of them in succession. This "swallows" strings like '}}'. We lazily look for any number of these strings, then the rest of any non-string code with [^']*?, and finally our ending }}.

This allows us to match the entire string {{code:$myvar = '}}';}} rather than just {{code:$myvar = '}}.

There are still problems with this method, however. Escaping a quote within a string, such as in {{code:$myvar = '\'}}\'';}} will not work, as we will "swallow" '\' first, and end with the }} immediately following. It may be possible to determine these escaped single-quotes as well, or to add in support for double-quoted strings, but you need to ask yourself at what point using a code-parser is a better idea.

See the entire Regex in action here. (If it doesn't match anything at first, just click the window.)


how can I use the result to say place it in new ,<div>

Use the replace function:

preg_replace($expression, "<div>$0</div>", $input)

$0 inserts the entire match, and will place it between a new <div> block. Alternatively, if you just want the actual source code, use $1, as we captured the source code in a separate capture group.

Again, see the replacement here.


I went deeper down the rabbit hole...

{{code:((?:(?:[^']|\\')*?(?<!\\)'(?:[^']|\\')*?(?<!\\)')*?(?:[^']|\\')*?)}}

This won't break with escaped single-quotes, and correctly matches {{code:$myvar = '\'}}\'';}}.

Ta-da.

4 Comments

+1 for taking the regex one step further but also recognizing that there are still failing test cases and continuing to work through them with a purely regex based approach is asking for trouble.
PHP supports a variety of ways to quote strings: single quotes, double quotes, heredoc and nowdoc. This regex is just the tip of the iceberg...
@Asaph - It is indeed just the tip of the iceberg. Frankly, I'd almost rather write my own PHP parser than try and write that massive regex. To be honest I mostly did this to see what it would take, knowing regular languages can't really count anything arbitrarily.
Thanks for all the help. You are right probably the best way is to write a short parser, my guess it will probably also perform faster.
0

use

preg_match_all("/{{(.)*}}/", $text, $match)

where text is the text that might contain code this captures anything between {{ }}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.