PHP regular expression optimization

Question

I am trying to optimize a PHP regular expression and am seeking guidance from the wonderful Stack Overflow community.

I am attempting to catch pre-defined matches in an HTML block such as:

##test##

##!test2##

##test3|id=5##

An example text that would run is:

Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus vitae pharetra.

I have two options so far. Thoughts on which is best from an optimization standpoint?

Option 1

~##(!?)(test|test2|test3)(|\S+?)##~s

Option 2

~\##(\S+)##~s

For the "!" in example \##!test2##, it is meant to flag an item for a special behavior while being processed. This could be moved to be an attribute like ##test3|force=true&id=5##. If this is the case, there'd be:

Option 3

~##(test|test2|test3)(|\S+?)##~s

The biggest factor that we are looking at is performance and optimization.

Thanks in advance for your help and insight!

but how to benchmark and understand which is best? run them and look at memory usage and time to run the code. — Andreas
– Andreas, Commented Dec 9, 2017 at 18:41
I'm agree with Andreas, the only way is to do massively test (10000+) and measure and compare your results — Mauricio Florez
– Mauricio Florez, Commented Dec 9, 2017 at 18:53
The preceding comments are correct, but you're missing other major problems. You need to escape the pipe symbol (|), as in (\|?). You do not need to escape a hash symbol (#). Also, it's not entirely clear what your parameters are for what the regex should match. But the simplest and probably fastest regex for what you're trying to do is probably going to look like this: ~##[^\s#]+?##~s. — elixenide
– elixenide, Commented Dec 9, 2017 at 20:29
Avoid alternations as much as possible since engine has to go inside each branch to find a satisfactory path. Best case would be passing through first side. Less patterns usually means more efficiency. Apply modifiers on need. s affects . which you didn't even use. Be greedy if possible. Engine likes it. ~##[^#]*##~ — revo
– revo, Commented Dec 9, 2017 at 20:59

Jan · Accepted Answer · 2017-12-09 21:23:33Z

2

As others have mentioned, you'll need to time your expressions. Python has the fantastic timeit module while for PHP you need to come up with your own solution:

<?php

$string = <<<DATA
Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus vitae pharetra.
DATA;

function timeit($regex, $string, $number) {
    $start = microtime(true);

    for($i=0;$i<$number;$i++) {
        preg_match_all($regex, $string, $matches);
    }

    return microtime(true) - $start;
}

$expressions = ['~##(!?)(test|test2|test3)(|\S+?)##~s', '~\##(\S+)##~s', '~##(test|test2|test3)(|\S+?)##~s'];
$cnt = 1;
foreach ($expressions as $expression) {
    echo "Expression " . $cnt . " took " . timeit($expression, $string, 10**5) . "\n";
    $cnt++;
}
?>

Running this on my computer (100k iterations each) yields

Expression 1 took 0.45759010314941
Expression 2 took 0.34269499778748
Expression 3 took 0.40994691848755

Obviously, you can play around with other strings and more iterations but this will give you a general idea.

answered Dec 9, 2017 at 21:23

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MrC Over a year ago

Thank you! This benchmarking test script was very, very helpful.

Jan Over a year ago

@MrC: If it helped you, you may upvote/accept it as an answer (green tick on the left).

mickmackusa · Accepted Answer · 2017-12-12 13:00:23Z

If you need to dissect and process your matching substrings based on character occurrences, it seems most logical to separate the components during the regex step -- concern yourself with pattern optimization after accuracy and ease of handling is ironed out.

My pattern contains three capture groups, only the middle one requires a positive-length string. Negated capture groups are used for pattern efficiency. I make the assumption that your substrings will not contain # which is used to delimit the substrings. If they may contain #, then please update your question and I'll update my answer.

Pattern Demo

Pattern Explanation:

/          // pattern delimiter
##         // match leading substring delimiter
(!)?       // optionally capture: an exclamation mark
([^#|]+)   // greedily capture: one or more non-hash, non-pipe characters
\|?        // optionally match: a pipe
([^#]+)?   // optionally capture: one or more non-hash characters
##         // match trailing substring delimiter
/          // pattern delimiter

Code: (Demo)

$string='Lorem ipsum dolor sit amet, ##test## consectetur adipiscing elit. Pellentesque id congue massa. Curabitur ##test3|id=5## egestas ullamcorper sollicitudin. Mauris venenatis sed metus ##!test2## vitae pharetra.';

$result=preg_replace_callback(
    '/##(!)?([^#|]+)\|?([^#]+)?##/',
    function($m){
        echo '$m = ';
        var_export($m);
        echo "\n";
        // execute custom processing:
        if(isset($m[1][0])){  //check first character of element (element will always be set because $m[2] will always be set)
            echo "exclamation found\n";
        }
        // $m[2] is required (will always be set)
        if(isset($m[3])){  // will only be set if there is a positive-length string in it
            echo "post-pipe substring found\n";
        }
        echo "\n---\n";
        return '[some replacement text]';
    },$string);

var_export($result);

Output:

$m = array (
  0 => '##test##',
  1 => '',
  2 => 'test',
)

---
$m = array (
  0 => '##test3|id=5##',
  1 => '',
  2 => 'test3',
  3 => 'id=5',
)
post-pipe substring found

---
$m = array (
  0 => '##!test2##',
  1 => '!',
  2 => 'test2',
)
exclamation found

---
'Lorem ipsum dolor sit amet, [some replacement text] consectetur adipiscing elit. Pellentesque id congue massa. Curabitur [some replacement text] egestas ullamcorper sollicitudin. Mauris venenatis sed metus [some replacement text] vitae pharetra.'

If you are performing custom replacement processes, this method will "optimize" your string handling.

Collectives™ on Stack Overflow

PHP regular expression optimization

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related