0

I have 2 lists. Actual and Predicted. I need to compare both lists and determine the number of fuzzy matches. The reason I say fuzzy matches is due to the fact that they will not be the exact same. I am using the SequenceMatcher from the difflib library.

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

I can assume that strings with a percentage match of above 80% are considered to be the same. Example Lists

actual=[ "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

I need a way to pick out that Apple, Peace and Orange in the predicted list has been found in the actual list. So only 3 matches have been made and not 5 matches. How do I do this efficiently?

1
  • 3
    What is the question? Commented Jun 21, 2017 at 12:32

8 Answers 8

3

You can use the following set comprehension to get the desired output using your similar method if fuzzy matching is indeed what you're looking for.

threshold = 0.8
result = {x for x in predicted for y in actual if similar(x, y) > threshold}
Sign up to request clarification or add additional context in comments.

Comments

1

You can turn both of the lists to sets and apply intersection on them.

That will give you three items {'Peace', 'Apple', 'Orange'}.

Than, you can calculate the ratio within the result set len to the actual list len.

actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

res = set(actual).intersection(predicted)

print (res)
print ((len(res) / len(actual)) * 100)

Edit:

In order to use the ratio you will need to implement nested loop. As set is implemented as a hash table so search is O(1), I would prefer to use the actual as a set.

If the predicted is in the actual (Exact match) so just add it to your result set. (best case is that all like that and final complexity is O(n)).

If the predicted is not in actual, loop though the actual and find whether a ratio over 0.8 is exist. (worst case is that all are like that, complexity (On^2))

actual={"Appl", "Orange", "Ornge", "Peace"}
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

result = {}

for pre in predicted:
    if pre in actual:
        result.add(pre)
    else:
        for act in actual:
            if (similar(pre, act) > 0.8):
                result.add(pre)

2 Comments

This approach won't take into account fuzzy matches. I have changed the list and now "Apple" will not be recognized even though Appl and Apple share an 88% match.
@JavaBeginner , Added an Edit section.
1
{x[1] for x in itertools.product(actual, predicted) if similar(*x) > 0.80}

Comments

0
>>> actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
>>> predicted=["Red", "Apple", "Green", "Peace", "Orange"]
>>> set(actual) & set(predicted)
set(['Orange', 'Peace', 'Apple'])

Comments

0

I this case you only have to check if i'th element of predicted list is present in actual list or not. if present, then add to new list.

In [2]: actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
...: predicted=["Red", "Apple", "Green", "Peace", "Orange"]


In [3]: [i for i in predicted if i in actual]
Out[3]: ['Apple', 'Peace', 'Orange']

Comments

0

Simple approach, but NOT effective, would be:

counter = 0
for item in b:
    if SequenceMatcher(None, a, item).ratio() > 0:
        counter += 1

This is what you want, the number of fuzzy matched elements, not only the same elements (as offered by most other answers).

Comments

0

First take the intersection of the two sets:

actual, predicted = set(actual), set(predicted)

exact = actual.intersection(predicted)

If this comprises all your actual words then you're done. However,

if len(exact) < len(actual):
    fuzzy = [word for word in actual-predicted for match in predicted if similar(word, match)>0.8]

Finally your resulting set is exact.union(set(fuzzy))

Comments

0

You can also try the following approach to achieve your requirement:

import itertools

fuzlist = [ "Appl", "Orange", "Ornge", "Peace"]
actlist = ["Red", "Apple", "Green", "Peace", "Orange"]
foundlist = []
for fuzname in fuzlist:
    for name in actlist:
        for actname in itertools.permutations(name):
            if fuzname.lower() in ''.join(actname).lower():
                foundlist.append(name)
                break

print set(foundlist)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.