PHP Regexp Optimize An Existing Pattern

Question

I'm using this code

preg_match_all("/([^#]+\btbds\b.+?)#/iu", $data, $matches);

to find all words named tbds, but its taking around 1.20 seconds to perform the pattern search. If I just use tbds\b instead of \btbds\b it takes just 0.19 seconds (6 times less).

preg_match_all("/([^#]+tbds\b.+?)#/iu", $data, $matches);

is there any way to optimize the word match \btbds\b to take around 0.19 seconds? I need to process a large amount of data.

here is the test code:

function generateRandomString($length = 10) {
    $characters = ' 0123 456 789 abcd efgh ijkl mn opqrstu vwx yzAB CDE FGHI JKL MNOP QRS TUVWX YZ';
    $charactersLength = strlen($characters);
    $randomString = '';
    for ($i = 0; $i < $length; $i++) {
        $randomString .= $characters[rand(0, $charactersLength - 1)];
    }
    $randomString = preg_replace('/\s+/', ' ', $randomString);
    return trim($randomString,' ');
}


$data=NULL;
for ($a = 1; $a < 1000000; $a++) 
    $data.=" ".generateRandomString(100)." #";


$t = microtime(true);
preg_match_all("/([^#]+\btbds\b.+?)#/iu", $data, $matches); 
echo microtime(true) - $t; echo "\n";

I need to process a large amount of data. Any help is welcomed :) — Miguel
– Miguel, Commented Mar 2, 2018 at 15:14
So, what do you want to do to this large amount of data, and why must it be done very quickly? Perhaps it could motivate us to help you, if we knew what you were trying to achieve? In other words: How can we optimize something we know nothing about? — KIKO Software
– KIKO Software, Commented Mar 2, 2018 at 15:18
Note that they don't match the same, as the one without \b will happily match atbds # while the other one will not — Sebastian Proske
– Sebastian Proske, Commented Mar 2, 2018 at 15:19
Might have a try with another approach: $res = preg_grep('/\btbds\b/i', explode("#", $data)); — bobble bubble
– bobble bubble, Commented Mar 2, 2018 at 19:55
@Miguel perhaps you could help us to understand your real input data because what you are generating randomly will never find a match -- there are no # symbols generated. If you want us to design an optimized pattern for you, we need to fully understand the input variability. Furthermore, you are performing unicode matching, but there are no unicode characters on offer. Please improve your question. — mickmackusa
– mickmackusa ♦, Commented Mar 5, 2018 at 3:09

frosti · Accepted Answer · 2018-03-02 15:28:43Z

1

What makes your regex slow is the preceding [^#]+

Maybe it helps if you define a starting point which can be either # or start of string like this:

/(?:(?<=#)|^)([^#]*\btbds\b.+?)#/iu

The Demo

answered Mar 2, 2018 at 15:28

frosti

111 bronze badge

Sign up to request clarification or add additional context in comments.

3 Comments

Miguel Over a year ago

I just tested your answer and its taking around 1.20 seconds.

frosti Over a year ago

I tested it at regex101 where it required about 1/10 steps and with your benchmark on php testing site where it also seemed considerably faster :)

Miguel Over a year ago

hi frosti your regex /(?:(?<=#)|^)([^#]*\btbds\b.+?)#/iu is having a similar performance like my original regex /([^#]+\btbds\b.+?)#/iu . I'm trying to get a performance similar to /([^#]+tbds\b.+?)#/iu around 0.19 seconds (6 times faster).

The fourth bird · Accepted Answer · 2018-03-02 18:33:23Z

1

Maybe this is a possibility to match # and then using \K to reset the starting point of the reported match.

Then match not a # one or more times with [^#]+ and then your tbds between word boundaries \btbds\b.

#\K[^#]+\btbds\b[^#]+#

edited Mar 2, 2018 at 18:33

answered Mar 2, 2018 at 17:38

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

3 Comments

Miguel Over a year ago

thanks for the answer. I just tested your suggestion and it seems to be having a performance similar to the original /([^#]+\btbds\b.+?)#/iu

The fourth bird Over a year ago

@Miguel Did you compare the number of steps on regex101 for a few examples?

Miguel Over a year ago

yes its with very few steps compared to the original but in terms of performance when running the test code I get similar execution time around 1.20

Collectives™ on Stack Overflow

PHP Regexp Optimize An Existing Pattern

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related