1

I'm using this code

preg_match_all("/([^#]+\btbds\b.+?)#/iu", $data, $matches);   

to find all words named tbds, but its taking around 1.20 seconds to perform the pattern search. If I just use tbds\b instead of \btbds\b it takes just 0.19 seconds (6 times less).

preg_match_all("/([^#]+tbds\b.+?)#/iu", $data, $matches); 

is there any way to optimize the word match \btbds\b to take around 0.19 seconds? I need to process a large amount of data.

here is the test code:

function generateRandomString($length = 10) {
    $characters = ' 0123 456 789 abcd efgh ijkl mn opqrstu vwx yzAB CDE FGHI JKL MNOP QRS TUVWX YZ';
    $charactersLength = strlen($characters);
    $randomString = '';
    for ($i = 0; $i < $length; $i++) {
        $randomString .= $characters[rand(0, $charactersLength - 1)];
    }
    $randomString = preg_replace('/\s+/', ' ', $randomString);
    return trim($randomString,' ');
}


$data=NULL;
for ($a = 1; $a < 1000000; $a++) 
    $data.=" ".generateRandomString(100)." #";


$t = microtime(true);
preg_match_all("/([^#]+\btbds\b.+?)#/iu", $data, $matches); 
echo microtime(true) - $t; echo "\n";
5
  • I need to process a large amount of data. Any help is welcomed :) Commented Mar 2, 2018 at 15:14
  • So, what do you want to do to this large amount of data, and why must it be done very quickly? Perhaps it could motivate us to help you, if we knew what you were trying to achieve? In other words: How can we optimize something we know nothing about? Commented Mar 2, 2018 at 15:18
  • Note that they don't match the same, as the one without \b will happily match atbds # while the other one will not Commented Mar 2, 2018 at 15:19
  • Might have a try with another approach: $res = preg_grep('/\btbds\b/i', explode("#", $data)); Commented Mar 2, 2018 at 19:55
  • @Miguel perhaps you could help us to understand your real input data because what you are generating randomly will never find a match -- there are no # symbols generated. If you want us to design an optimized pattern for you, we need to fully understand the input variability. Furthermore, you are performing unicode matching, but there are no unicode characters on offer. Please improve your question. Commented Mar 5, 2018 at 3:09

2 Answers 2

1

What makes your regex slow is the preceding [^#]+

Maybe it helps if you define a starting point which can be either # or start of string like this:

/(?:(?<=#)|^)([^#]*\btbds\b.+?)#/iu

The Demo

Sign up to request clarification or add additional context in comments.

3 Comments

I just tested your answer and its taking around 1.20 seconds.
I tested it at regex101 where it required about 1/10 steps and with your benchmark on php testing site where it also seemed considerably faster :)
hi frosti your regex /(?:(?<=#)|^)([^#]*\btbds\b.+?)#/iu is having a similar performance like my original regex /([^#]+\btbds\b.+?)#/iu . I'm trying to get a performance similar to /([^#]+tbds\b.+?)#/iu around 0.19 seconds (6 times faster).
1

Maybe this is a possibility to match # and then using \K to reset the starting point of the reported match.

Then match not a # one or more times with [^#]+ and then your tbds between word boundaries \btbds\b.

#\K[^#]+\btbds\b[^#]+#

3 Comments

thanks for the answer. I just tested your suggestion and it seems to be having a performance similar to the original /([^#]+\btbds\b.+?)#/iu
@Miguel Did you compare the number of steps on regex101 for a few examples?
yes its with very few steps compared to the original but in terms of performance when running the test code I get similar execution time around 1.20

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.