How to find a substring using partial matching [closed]

Question

Closed. This question needs details or clarity. It is not currently accepting answers.

Want to improve this question? As written, this question is lacking some of the information it needs to be answered. If the author adds details in comments, consider editing them into the question. Once there's sufficient detail to answer, vote to reopen the question.

Closed 12 years ago.

Improve this question

bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
smallString = "I_AM_HERE"

Which efficient algorithm should I use to find a substring of the "bigString" that matches closely to the "smallString"

output = "I_AM_THERE"

The output may have few insertions and deletions when compared with small string.

Edit: Found a good example, very close to my problem here: How to add variable error to regex fuzzy search. Python

When you say that matches closely do you mean exact match only or a fuzzy match? — dawg
– dawg, Commented Nov 12, 2013 at 21:12
@jazzpi: Great minds not only think alike, but at the same time... — Scott Hunter
– Scott Hunter, Commented Nov 12, 2013 at 21:21

zelusp · Accepted Answer · 2016-04-24 04:02:07Z

8

You can use the almost-ready-to-be-everyones-regex package with fuzzy matching:

>>> import regex
>>> bigString = "AGAHKGHKHASNHADKRGHFKXXX_I_AM_THERE_XXXXXMHHGRFSAHGSKHASGKHGKHSKGHAK"
>>> regex.search('(?:I_AM_HERE){e<=1}',bigString).group(0)
'I_AM_THERE'

Or:

>>> bigString = "AGAH_I_AM_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_NOWHERE_EREXXMHHGRFS"
>>> print(regex.findall('I_AM_(?:HERE){e<=3}',bigString))
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']

The new regex module will (hopefully) be part of Python3.4

If you have pip, just type pip install regex or pip3 install regex until Python 3.4 is out (with regex part of it...)

Answer to comment Is there a way to know the best out of the three in your second example? How to use BESTMATCH flag here?

Either use the best match flag (?b) to get the single best match:

print(regex.search(r'(?b)I_AM_(?:ERE){e<=3}', bigString).group(0))
# I_AM_THE

Or combine with difflib or take a levenshtein distance with a list of all acceptable matches to the first literal:

import regex

def levenshtein(s1,s2):
    if len(s1) > len(s2):
        s1,s2 = s2,s1
    distances = range(len(s1) + 1)
    for index2,char2 in enumerate(s2):
        newDistances = [index2+1]
        for index1,char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1])
            else:
                newDistances.append(1 + min((distances[index1],
                                             distances[index1+1],
                                             newDistances[-1])))
        distances = newDistances
    return distances[-1]

bigString = "AGAH_I_AM_NOWHERE_HERE_RGHFKXXX_I_AM_THERE_XXX_I_AM_HERE_EREXXMHHGRFS"
cl=[(levenshtein(s,'I_AM_HERE'),s) for s in regex.findall('I_AM_(?:HERE){e<=3}',bigString)]

print(cl)
print([t[1] for t in sorted(cl, key=lambda t: t[0])])

print(regex.search(r'(?e)I_AM_(?:ERE){e<=3}', bigString).group(0))

Prints:

[(3, 'I_AM_NOWHERE'), (1, 'I_AM_THERE'), (0, 'I_AM_HERE')]
['I_AM_HERE', 'I_AM_THERE', 'I_AM_NOWHERE']

edited Apr 24, 2016 at 4:02

zelusp

3,7283 gold badges35 silver badges69 bronze badges

answered Nov 13, 2013 at 0:44

dawg

105k24 gold badges143 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user1140126 Over a year ago

@dawg, Thanks. Is there a way to know the best out of the three in your second example? How to use BESTMATCH flag here?

zelusp Over a year ago

Hi @dawg - this is a great answer. I've edited to reflect how to use the under-undocumented BESTMATCH flag (?b) instead of the similar but different ENHANCEMATCH flag (?e). Cheers!

zelusp Over a year ago

... Here's a simple demo of (?b) in action.

data_steve Over a year ago

@dawg where is the documentation explaining the regex usage? The pypi page and homepage for regex does not give working examples of this {e<=3} type error/fuzzy syntax. Im trying to find all match candidates above a threshold percentage match. I want the matches and their percentages. So far examples I've found only provide the candidates above a threshold but not the percentages for each candidate

dawg Over a year ago

The documentation for the fuzzy su=yntax is in the PyPI documentation page for the regex module.

beroe · Accepted Answer · 2013-11-12 22:08:14Z

0

Here is a bit of a hacky way to do it with difflib:

from difflib import *

window = len(smallString) + 1  # allow for longer matches
chunks = [bigString[i:i+window] for i in range(len(bigString)-window)]
get_close_matches(smallString,chunks,1)

Output:

['_I_AM_THERE']

answered Nov 12, 2013 at 22:08

beroe

12.4k6 gold badges40 silver badges82 bronze badges

Comments

Paul · Accepted Answer · 2013-11-13 02:38:14Z

0

Maybe the dynamic programming problem Longest Common Substring would be of some use here. Depending on your needs and matching criteria you could perhaps use Longest Common Subseuence

answered Nov 13, 2013 at 2:38

Paul

7,3758 gold badges45 silver badges41 bronze badges

Collectives™ on Stack Overflow

How to find a substring using partial matching [closed]

3 Answers 3

5 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

Comments

Linked

Related