Python regex problems with string matching

Question

Sorry in advance for such a long post

EDIT--

Modified from Norman's Solution to print and return if we find an exact solution, otherwise print all approximate matches. It's currently still only getting 83/85 matches for a specific example of searching for etnse on the dictionary file provided below on the third pastebin link.

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = []

    begin = time.time()

    # get all of the patterns associated with the given phrase
    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        for m in re.finditer(pattern, entireFile):
            startIndices.append((m.start(), m.end(), m.group()))
        # if the first pattern(exact match) is valid, then just print the results and we're done
        if len(startIndices) != 0 and startIndices[0][2] == origPattern:
            print("\nThere is an exact match at: [{}:{}] for {}").format(*startIndices[0])
            return

    print('Used {} patterns:').format(len(patterns))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}').format(i, p)

    # list for all non-overlapping starting indices
    nonOverlapping = []
    # hold the last matches ending position
    lastEnd = 0
    # find non-overlapping matches by comparing each matches starting index to the previous matches ending index
    # if the starting index > previous items ending index they aren't overlapping
    for start in sorted(startIndices):
        print(start)
        if start[0] >= lastEnd:
            # startIndicex[start][0] gets the ending index from the current matches tuple
            lastEnd = start[1]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:').format(len(startIndices))
    # i is the key <starting index> assigned to the value of the indices (<ending index>, <string at those indices>
    for start in sorted(startIndices):
        # *startIndices[i] means to unpack the tuple associated to the key i's value to be used by format as 2 inputs
        # for explanation, see: http://stackoverflow.com/questions/2921847/what-does-the-star-operator-mean-in-python
        print('- [{}:{}]  {}').format(*start)

    print()
    print('Found {} non-overlapping matches:').format(len(nonOverlapping))
    for ov in nonOverlapping:
        print('- [{}:{}]  {}').format(*ov)

    end = time.time()
    print(end-begin)

def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes. (replace)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # replace with a wildcard for each index
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions. (omitted)
    for i in range(len(origPattern)):
        t = origPattern[:]
        # remove a char for each index
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        # insert a wildcard between adjacent chars for each index
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

ORIGINAL: http://pastebin.com/bAXeYZcD - the actual function

http://pastebin.com/YSfD00Ju - data to use, should be 8 matches for 'ware' but only gets 6

http://pastebin.com/S9u50ig0 - data to use, should get 85 matches for 'etnse' but only gets 77

I left all of the original code in the function because I'm not sure exactly what is causing the problem.

you can search for 'Board:isFull()' on anything to get the error stated below.

examples:

assume you named the second pastebin 'someFile.txt' in a folder named files in the same directory as the .py file.

file = open('./files/someFile.txt', 'r')
doMatching(file, "ware")

OR

file = open('./files/someFile.txt', 'r')
doMatching(file, "Board:isFull()")

OR

assume you named the third pastebin 'dictionary.txt' in a folder named files in the same directory as the .py file.

file = open('./files/dictionary.txt', 'r')
doMatching(file, "etnse")

--EDIT

The functions parameters work like so:

file is the location of a file.

origPattern is a phrase.

The function is basically supposed to be a fuzzy search. It's supposed to take the pattern and search through a file to find matches that are either exact, or with a 1 character deviation. i.e.: 1 missing character, 1 extra character, 1 replaced character, or 1 character swapped with an adjacent character.

For the most part it works, But i'm running into a few problems.

First, when I try to use something like 'Board:isFull()' for origPattern I get the following:

    raise error, v # invalid expression
sre_constants.error: unbalanced parenthesis

the above is from the re library

I've tried using re.escape() but it doesn't change anything.

Second, when I try some other things like 'Fun()' it says it has a match at some index that doesn't even contain any of that; it's just a line of '*'

Third, When it does find matches it doesn't always find all of the matches. For example, there's one file I have that should find 85 matches, but it only comes up with like 77, and another with 8 but it only comes up with 6. However, they are just alphabetical so it's likely only a problem with how I do searching or something.

Any help is appreciated.

I also can't use fuzzyfinder

How do I fix it on here? when I press tab it doesn't do anything to the code — Darrel Holt
– Darrel Holt, Commented Apr 10, 2016 at 6:20
Share your code on pastebin or ideone..just for correct indentation — rock321987
– rock321987, Commented Apr 10, 2016 at 6:22

Norman · Accepted Answer · 2016-04-10 15:55:23Z

1

I found some issues in the code:

re.escape() seems to not work because its result is not assigned.
Do origPattern = re.escape(origPattern).
When pattern is correctly escaped, be mindful of not breaking the escaping when manipulating the pattern.
Example: re.escape('Fun()') yields the string Fun\(\). The two \( substrings in it must never be separated: never remove, replace, or swap a \ without the char it escapes.
Bad manipulations: Fun(\) (removal), Fu\n(\) (swap), Fun\.{0,2}\).
Good manipulations: Fun\) (removal), Fu\(n\) (swap), Fun.{0,2}\).
You find too few matches because you only try to find fuzzy matches if there are no exact matches. (See line if indices.__len__() != 0:.) You must always look for them.
The loops inserting '.{0,2}' produce one too many pattern, e.g. 'ware.{0,2}' for ware. Unless you intend that, this pattern will find wareXY which has two insertions.
The patterns with .{0,2} don't work as described; they allow one change and one insertion.
I'm not sure about the code involving difflib.Differ. I don't understand it, but I suspect there should be no break statements.
Even though you use a set to store indices, matches from different regexes may still overlap.
You don't use word boundaries (\b) in your regexes, though for natural language that would make sense.
Not a bug, but: Why do you call magic methods explicitly?
(E.g. indices.__len__() != 0 instead of len(indices) != 0.)

I rewrote your code a bit to address any issues I saw:

def doMatching(file, origPattern):
    entireFile = file.read()
    patterns = []
    startIndices = {}

    for pattern in generateFuzzyPatterns(origPattern):
        patterns.append(pattern)
        startIndices.update((m.start(), (m.end(), m.group())) for m in re.finditer(pattern, entireFile))

    print('Used {} patterns:'.format(len(patterns)))
    for i, p in enumerate(patterns, 1):
        print('- [{}]  {}'.format(i, p))

    nonOverlapping = []
    lastEnd = 0
    for start in sorted(startIndices):
        if start >= lastEnd:
            lastEnd = startIndices[start][0]
            nonOverlapping.append(start)

    print()
    print('Found {} matches:'.format(len(startIndices)))
    for i in sorted(startIndices):
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))

    print()
    print('Found {} non-overlapping matches:'.format(len(nonOverlapping)))
    for i in nonOverlapping:
        print('- [{}:{}]  {}'.format(i, *startIndices[i]))


def generateFuzzyPatterns(origPattern):
    # Escape individual symbols.
    origPattern = [re.escape(c) for c in origPattern]

    # Find exact matches.
    pattern = ''.join(origPattern)
    yield pattern

    # Find matches with changes.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = '.'
        pattern = ''.join(t)
        yield pattern

    # Find matches with deletions.
    for i in range(len(origPattern)):
        t = origPattern[:]
        t[i] = ''
        pattern = ''.join(t)
        yield pattern

    # Find matches with insertions.
    for i in range(len(origPattern) + 1):
        t = origPattern[:]
        t.insert(i, '.')
        pattern = ''.join(t)
        yield pattern

    # Find two adjacent characters being swapped.
    for i in range(len(origPattern) - 1):
        t = origPattern[:]
        if t[i] != t[i + 1]:
            t[i], t[i + 1] = t[i + 1], t[i]
            pattern = ''.join(t)
            yield pattern

answered Apr 10, 2016 at 15:55

Norman

1,9751 gold badge20 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

Darrel Holt Over a year ago

Thanks a lot for that, I'm going to run it right now. The yield keyword you used for finding the matches means you're using generators in the for loops, right? I had no idea what the yield keyword was until you used it in this code so I had to go look it up.

Darrel Holt Over a year ago

I just noticed it still doesn't get all 85, but it got 79 this time. I wonder if maybe the data I was given doesn't have 85 possible matches because I can't imagine this being any better than it already is.

Darrel Holt Over a year ago

On the first for loop in doMatching() where it iterates through each pattern from generateFuzzyPatterns() how does it get every pattern that generateFuzzyPatterns() 'yields'? I thought the yield function was like a return function where it would break out of the function to return. How does it not yield the 'exact matches' pattern every time it's called with the aforementioned for loop?

Darrel Holt Over a year ago

There is a problem with using a dictionary I just found out. lets say the origPattern passed is 'zenith'. it adds it when it goes over exact match, but when it matches with removals it updates the index that matched 'zenith' with 'zenth' for example. Thus throwing away the exact match. This is also why it comes up with 79 matches instead of 85, because it overwrites the previous matches. I think i'll have to make a dict of lists for it to add all possibilities from what I've read but I haven't figured it out yet.

Norman Over a year ago

You're right. A dict sounds good, with tuples of (start, end) as keys. Also, verify that insertions work correctly: the code also inserts a wildcard before and after the word, which disagrees with your comment ("insert between adjacent chars"). Also, who says there's 8 matches? It's easy to manually verify that and then contrast with the code's result.

|

Collectives™ on Stack Overflow

Python regex problems with string matching

1 Answer 1

9 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

9 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related