1

I am looking for help to write an efficient PHP algorithm to help me find occurances of a String within another string. Here is currently the situation.

I have two arrays. The first array is the array with text that needs searched (haystack). The second array is an array of terms ot find (needle).

I know that my first array has at least one of my terms from the needles. So, the algorithm needs to say 'is array2[0] found inside array1[0]? if not, loop, is array2[1] found inside array1[0], etc' If it is found, exit, advance array1[1] pointer and repeat the process.

I want to make sure this is efficient as I have 10s of 1000s of entries to pricess, and my needle array has 1100 individual needles.

2
  • 1
    You're probably looking for the Boyer-Moore algorithm or one of its variants – they have approximately O(N) complexity. The original lets you cache a preprocessing step which could save you some time if you reuse the same needles a lot. Commented Feb 15, 2012 at 1:48
  • 1
    (johannburkard.de/software/stringsearch has a bunch of decent implementations of the algorithms you could try and port into PHP, or search for an existing one.) Commented Feb 15, 2012 at 1:52

2 Answers 2

1

Ok, let's start with this algorithm, it might not be the fastest but the result is what you want. (Keep loping UNTIL you found the first match)

<?php
for ($i = 0; $i < 1000; $i++) {
    $haystack[] = "Lorem ipsum dolor";
    $needle[] = "no match";
}
// $haystack = array("Lorem ipsum dolor", "Quisque placerat", "Cras quis porttitor orci");
//$needle = array("quis", "Lorem");
$timestamp1 = time() +  microtime();
foreach ($haystack as $word){
    foreach ($needle as $pattern){
        if(strpos($word, $pattern) === false){
            //Keep looping
        }else{
            //exit inner loop
            print "'".$pattern."' is in '".$word."'<br />";
            break;
        }
    }
}

$timestamp2 = time() + microtime();
print "It took me ".($timestamp2 - $timestamp1)." seconds to realize there was no match";

?>

//EDIT: I commented the hard coded array, creating it now dynamically an added a timer. It takes about 1 second max, if there is no match.

Sign up to request clarification or add additional context in comments.

1 Comment

Johannes, thanks for that. Your script uses strpos while mine was using stristri. Switching functions made my script perform a LOT better.
1

A trie data structure of the haystack recorded with some other informations like word position (page, line and word number) is more efficient. It uses a divide and conquer strategy to avoid useless lookups. With a loop strategy every item in the haystack would be searched. A trie sort the haystack and you can skip some haystacks. Here is an example in PHP: http://phpir.com/tries-and-wildcards

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.