0

I'm writing a search engine for my site and need to extract chunks of text with given keyword and few words around for the search result list. I ended with something like that:


/**
 * This function return part of the original text with
 * the searched term and few words around the searched term
 * @param string $text Original text
 * @param string $word Searched term
 * @param int $maxChunks Number of chunks returned
 * @param int $wordsAround Number of words before and after searched term
 */
public static function searchTerm($text, $word=null, $maxChunks=3, $wordsAround=3) {
        $word = trim($word);
        if(empty($word)) {
            return NULL;
        }
        $words = explode(' ', $word); // extract single words from searched phrase
        $text  = strip_tags($text);  // clean up the text
        $whack = array(); // chunk buffer
        $cycle = 0; // successful matches counter
        foreach($words as $word) {
            $match = array();
            // there are named parameters 'pre', 'term' and 'pos'
            if(preg_match("/(?P\w+){0,$wordsAround} (?P$word) (?P\w+){0,$wordsAround}/", $text, $match)) {
                $cycle++;
                $whack[] = $match['pre'] . ' ' . $word . ' ' . $match['pos'];
                if($cycle == $maxChunks) break;
            }
        }
        return implode(' | ', $whack);
    }
This function does not work, but you can see the basic idea. Any suggestions how to improve the regular expression is welcome!

4
  • Why do you split the string if you want several words around ? Commented Oct 8, 2010 at 12:08
  • 1
    The whole construction looks way too complicated in my opinion. Do you really need to cut the text at word boundaries? You could simply use PHPs substr()-function otherwise. Using plain variables in regular expressions is a bit problematic, too. Take a look at preg_quote() or use strpos(). Commented Oct 8, 2010 at 12:19
  • In this line: if($cycle == $maxCycles) continue; you use the variable $maxCycles. I think you would actually want to put $maxChunks there, don't you? Commented Oct 8, 2010 at 12:58
  • @MatTheCat - I'd like to search for every possible word in the phrase, not the exact phrase @elusive - yes. It wouldn't look good, if the words were cut @slosd - you are right Commented Oct 8, 2010 at 13:22

2 Answers 2

1

Never, never inject user content into the pattern of a RegEx without using preg_quote to sanitize the input:

https://www.php.net/manual/en/function.preg-quote.php

Sign up to request clarification or add additional context in comments.

3 Comments

OK, that's one suggestion, but if the regular does not work, this is not critical. Thanks anyway, I'll put the preg_quote in.
Are you trying to optimize the RegEx or fix it?
I'm no friend of regular expressions, so this was my first idea but I wasn't able to move on and make i work the right way
1

why re-invent the wheel here doesn't google have the best search engine I would look at their appliance

1 Comment

I know they have it and I like the way they have it. But I was hoping to solve the problem with one lightweight function, not the whole third party's search engine..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.