1

I have a loop, which takes a large amount of text in each iteration and replaces specific placeholder ('token') with some other content like so:

$string = $pageContent;
foreach($categories as $row) {
    $images = $mdlGallery->getByCategory($row['id']);
    if (!empty($images)) {
        $plug = Plugin::get('includes/gallery', array('rows' => $images));
        $string = str_replace($row['token'], $plug, $string);
    }
}

The Plugin class and it's get() method simply takes the right file from a specific directory and outputs buffer as a string.

There might be a large number of categories therefore I wonder whether whether it would be better to first check the input string for an occurrence of the specific 'token' before going through populating all images from a given category using strpos() function like so:

foreach($categories as $row) {
    if (strpos($string, $row['token']) !== false) {
        $images = $mdlGallery->getByCategory($row['id']);
        if (!empty($images)) {
            $plug = Plugin::get('includes/gallery', array('rows' => $images));
            $string = str_replace($row['token'], $plug, $string);
        }
    }
}

My concern is the performance - would this help? - consider $string to potentially contain a large number of characters (TEXT field type in MySQL)?

6
  • Do any of the methods inside the foreach connect to the DB? Or read entire files into memory (file_get_contents)? Commented Jul 17, 2012 at 10:49
  • As with what @MihaiStancu said. It's the retrieval of this data that will cost performance. Is $mdlGallery->getByCategory($row['id']); a database call? Commented Jul 17, 2012 at 10:53
  • yes - getByCategory() method fetches records from the database and Plugin::get() method includes the file inside of the output buffer and returns the result as string. Commented Jul 17, 2012 at 10:54
  • Well... not one, but two costly operations, no wonder it's slow. You should (at least) be fetching the data from the DB with one single call, using multiple connect->query->read->close sequences in the DB calls makes no sense for a bulk(ish) operation. Commented Jul 17, 2012 at 11:00
  • Connection isn't closing after each statement - I'm using persistent connection (PDO::ATTR_PERSISTENT). I guess in that case it makes sense to use the strpos() first. Commented Jul 17, 2012 at 11:05

3 Answers 3

4

To solve your problem

As per your example code it seems that the files used in Plugin::get() are small in size which means including them or reading them should not incur large performance costs, but if there are a lot of them you may need to consider those costs due to OS queuing mechanisms even if the data they contain is not big.

The getByCategory method should incur large performance costs because it implies many connect->query->read->close communication sequences to the database and each implies the transfer of a large amount of data (the TEXT fields you mentioned).

You should consider fetching the data as a batch operation with one single SQL query and storing it in a cache variable indexed by the row id so that getByCategory can fetch it from the cache.

Your current problem is not a matter of simple code review, it's a matter of approach. You have used a typical technique for small datasets as an approach to handling large datasets. The notion of "wrap a foreach over the simple script" works if you have medium datasets and don't feel a performance decay, if you don't you need a separate approach to handle the large dataset.

To answer your question

Using strpos means running through the entire haystack once to check if it contains the needle, and after that running through it again to do the replace with str_replace.

If the haystack does not contain the needle, strpos === str_replace (in the matter of computational complexity) because both of them have to run through the entire string to the end to make sure no needles are there.

Using both functions adds 100% more computational complexity for any haystack that does not contain the needle and increases the computational complexity anywhere from 1% to 100% more computational complexity for any haystack that does contain the needle because strpos will return right after the first needle found which can be found at the start of the string, the middle or the end.

In short don't use strpos it does not help you here, if you were using preg_replace the RegEx engine might have incurred more computational complexity than strpos for haystacks that do not contain the needle.

Sign up to request clarification or add additional context in comments.

3 Comments

To add.. You either need to request all the data that may be needed before the loop OR If the table has too much data the hold in memory you need to find out exactly what you need to retrieve first, and then use SQL to get your data
Thanks Mihai. The getByCategory() fetches image records assigned to the currently iterated category - the large text is the $string, which might contain placeholders representing category. We are looping through each category which has a token index that might be found in the $string. If found then I call the getByCategory() method to fetch all images assigned to this category - this is obviously if I use strops() function to first check for the occurance of the token - otherwise it would be executed anyway.
I've update my answer to explain why strpos in your particular case is of no use, on the contrary it'll add to complexity.
0

Thanks Mihai - that makes a lot of sense, however - in this particular scenario even if I get all of the records from the database first - meaning all of the images with associated categories - it would be rare that the $string would contain more than just one or two 'tokens' - meaning that using strpos() could actually save time if there were many categories ('tokens') to compare against.

Imagine we don't call the getByCategory in each iteration because we already store all possible records in earlier generated array - we still have to go through output buffering inside of the Plugin::get() method and str_replace() - meaning that if we have say 20 categories - this would occur 20 times without necessarily 'token' being included within the $string.

So your suggestion would work if there was suppose to be a large number of 'tokens' found in the $string comparing to the number of categories we are looping through, but for a small number of 'tokens' I think that strpos() would still be beneficial as that would be the only one executed for each category rather then two following when the strpos() returns true - in which case it's a small price to pay in the form of strpos() comparing to ob and str_replace together each time in the loop - don't you think?

I very much appreciate your explanation though.

8 Comments

Imagine the following two operations strpos('c', 'a b d e f g h i'), and str_replace('c', 'x', 'a b d e f g h i'), they both do the exact same thing, loop through all 15 characters in the string looking for (and not finding the letter 'c'). So both internally use a for loop that iterates 15 times and compares equality.
Now imagine the following two operations on another string strpos('c', 'a b c d e f g h c i'), and str_replace('c', 'x', 'a b c d e f g h c i'), strpos runs and internally uses a for that iterates 5 times until it finds the first occurrence of the letter 'c' then it allows str_replace to run which internally uses a for that iterates 15 times until it manages to replace the two occurrences of the letter 'c'.
In the first example you need to run through the string once with strpos which means 15 iterations in a for loop. But when you find out there is no needle you don't need to call str_replace. If you would have run str_replace instead you would have had the exact same result, it would have run through the string for a total of 15 iterations and concluded that there's nothing to replace.
In the second example you run strpos stop after which runs through the string and finds an occurrence at position 5 (so you've used 5 iterations until now) and returns, then you run str_replace which starts from the beginning too and runs through the string for 15 iterations to execute the replacements, a total of 5+15 iterations. If you would ahve run str_replace instead you would have had only the 15 iterations necessary to do the replacements.
Since in the first scenario 15steps == 15steps there is no upside or downside to using either, in the second scenario 15steps < 20steps so strpos before str_replace adds 5steps in complexity. If the first occurrence was at the very end of the string strpos would have added 15steps in complexity (literally doubled it). So my opinion remains that you should not use strpos unless you need to do other stuff alongside str_replace only when the string contains the tokens. If you're using strpos to check for token presence str_replace already does checks.
|
0

I think it's better to benchmark stuff by yourself if you are looking for optimization (especially for micro-optimization). Any implementation has more that one variation (usually) so it's better to benchmark your used variation. According to this you can see the benchmark results here: with strpos: http://3v4l.org/pb4hY#v533 without strpos: http://3v4l.org/v35gT

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.