Implementing Basic Efficient "Search" Algorithm: Python

Question

Mock Search Algorithm

Here is my problem:

In a given document(string) such as:

"hello there my name is dominic and my name is very special"

and searchTerms(list) such as:

['my','dominic'] or ['dominic','my'] (shouldn't matter)

an algorithm would return the shortest excerpt of the document containing the terms:

>>> 'dominic and my'

because

>>> 'my name is dominic'

contains more words than the previous.

Here is my current idea:

Start the algorithm off by creating a list with a number of inner list elements equal to the number of search terms. Inside each number of list would be indices in which that search element appears in the document.

document = document.split();

So searchTerms = ['my', 'dominic'] would return

                 [[2,7], [5]]

because my appears at index 2 and 7 and dominic appears only at 5.

The algorithm would then take this list and generate a list of all posibilities:

                 [[2,5], [7,5]]

As you can see there are two substrings in the document string that contain both my and dominic. Then the algo could take the range of both inner lists my doing a max()-min() This would tell me the second exerpt of document is smaller than the first and could then return document[5:(7+1)] which would be the expected outcome.

For my idea this is what I have so far:

document = "hello there my name is dominic and my name is very special"
searchTerms = ['my', 'dominic']
def answer(document, searchTerms):
    index = []
    document = document.split()
    for a in range(0, len(searchTerms)):
        index.append([i for i, x in enumerate(document) if x == searchTerms[a]])
    return index

As of now this returns [[2,7],[5]] however I am running into mainly one issue:

- Efficiency:

Is this solution efficient for a document string and search terms list that is extremely large? And if not what could be done to make it more efficient or is my original idea not good

I appreciate any insight into solving this issue, thank you.

Your time complexity is O(n^2) in the number of words in your document. Not horrible, but not great either... What about creating a dict that tells you all positions containing a word ({"word": [36, 42, 51], ...}). Then you just need to find the closest positions for two words.. — thebjorn
– thebjorn, Commented Jan 9, 2015 at 21:52
Would this be just as easy to implement if there were say twenty search terms? — Dominic Farolino
– Dominic Farolino, Commented Jan 9, 2015 at 21:58
You're probably going to find interesting problems if you have 20 search terms, but it shouldn't be any harder to implement. You'll need to figure which combination of positions for word_1, word_2, ... will give the smallest distance. If you calculate all possibilities that would be O(m^m) where m is the average number of times a word occurs in the document. That's a really bad time complexity, but you can probably come up with smarter solutions ;-) — thebjorn
– thebjorn, Commented Jan 9, 2015 at 22:07

Jarlax · Accepted Answer · 2015-01-10 14:02:34Z

Your algorithm can perform really slow on mediocre inputs. Imagine that you have 10 search terms and text consisting of 10000 words. In this case it's possible that for each term you will have list of 1000 indexes. This will end with generation of 1000^10 total possibilities.

In terms of big-O notation complexity is O((n/k)^k) where n is number of terms in text, k - number of search terms.

Here is idea faster algorithm. While iterating document word by word we need to keep track of closest search term indexes to the current position. Let's call this structure lookup (simple python's dict). Quick example:

"hello there my name is dominic and >my< name is very special"

Suppose we are about to visit "my" word which is highlighted. At this moment lookup is {"my": 2, "dominic": 5}. Current "my" will be closer to any further word in text. So when visiting next word ("name") we will have updated version {"my": 7, "dominic": 5}. It's easy to see that optimal solution corresponds to one of the states of lookup. So to get an answer just keep track of max()-min() of values in the dictionary. Note: you should start keeping track only when all search terms will be present as keys of lookup.

At each occurrence of search term we need to iterate k values from positions lookup, so complexity of this algorithm is O(nk).

To make it even better, you can additionally use balanced BST with indexes from lookup. Now instead of iterating lookup values (O(k)) you can retrieve min index in O(logk):

min_index = bst.min()
old_index=lookup[curr_term] # O(1)
bst.delete(old_index) # O(logk)
bst.insert(new_index) # O(logk)

In this case total complexity will be O(nlogk).

Edit. The code without tree optimization (haven't found built-in BST in Python):

document = "hello there my name is dominic and my name is very special"
searchTerms = { 'my', 'dominic' } # set has faster lookups

doc_words = document.split()

from sys import maxint
def search(words, terms):
    found_terms = [[i,x] for i,x in enumerate(words) if x in terms]
    lookup = {}
    cnt = maxint
    k = len(terms)
    start,end=-1,-1
    for i,w in found_terms:
        lookup[w] = i 
        if k == len(lookup):
            min_idx = min(lookup.values())
            curr = i - min_idx
            if curr < cnt:
                cnt,start,end = curr,min_idx,i
    return words[start:end + 1]

Collectives™ on Stack Overflow

Implementing Basic Efficient "Search" Algorithm: Python

Mock Search Algorithm

Here is my problem:

Here is my current idea:

- Efficiency:

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Mock Search Algorithm

Here is my problem:

Here is my current idea:

- Efficiency:

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related