Python - Matching strings from 2 lists

Question

I have 2 lists. Actual and Predicted. I need to compare both lists and determine the number of fuzzy matches. The reason I say fuzzy matches is due to the fact that they will not be the exact same. I am using the SequenceMatcher from the difflib library.

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

I can assume that strings with a percentage match of above 80% are considered to be the same. Example Lists

actual=[ "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

I need a way to pick out that Apple, Peace and Orange in the predicted list has been found in the actual list. So only 3 matches have been made and not 5 matches. How do I do this efficiently?

What is the question?

omri_saadon
– omri_saadon

2017-06-21 12:32:28 +00:00
Commented Jun 21, 2017 at 12:32 — omri_saadon
– omri_saadon, Commented Jun 21, 2017 at 12:32

athul.sure · Accepted Answer · 2017-06-21 12:43:39Z

3

You can use the following set comprehension to get the desired output using your similar method if fuzzy matching is indeed what you're looking for.

threshold = 0.8
result = {x for x in predicted for y in actual if similar(x, y) > threshold}

answered Jun 21, 2017 at 12:43

athul.sure

3286 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

omri_saadon · Accepted Answer · 2017-06-21 13:16:42Z

1

You can turn both of the lists to sets and apply intersection on them.

That will give you three items {'Peace', 'Apple', 'Orange'}.

Than, you can calculate the ratio within the result set len to the actual list len.

actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

res = set(actual).intersection(predicted)

print (res)
print ((len(res) / len(actual)) * 100)

Edit:

In order to use the ratio you will need to implement nested loop. As set is implemented as a hash table so search is O(1), I would prefer to use the actual as a set.

If the predicted is in the actual (Exact match) so just add it to your result set. (best case is that all like that and final complexity is O(n)).

If the predicted is not in actual, loop though the actual and find whether a ratio over 0.8 is exist. (worst case is that all are like that, complexity (On^2))

actual={"Appl", "Orange", "Ornge", "Peace"}
predicted=["Red", "Apple", "Green", "Peace", "Orange"]

result = {}

for pre in predicted:
    if pre in actual:
        result.add(pre)
    else:
        for act in actual:
            if (similar(pre, act) > 0.8):
                result.add(pre)

edited Jun 21, 2017 at 13:16

answered Jun 21, 2017 at 12:37

omri_saadon

10.7k8 gold badges36 silver badges58 bronze badges

2 Comments

Bryce Ramgovind Over a year ago

This approach won't take into account fuzzy matches. I have changed the list and now "Apple" will not be recognized even though Appl and Apple share an 88% match.

omri_saadon Over a year ago

@JavaBeginner , Added an Edit section.

TTT · Accepted Answer · 2017-06-21 13:50:14Z

1

{x[1] for x in itertools.product(actual, predicted) if similar(*x) > 0.80}

edited Jun 21, 2017 at 13:50

answered Jun 21, 2017 at 12:51

TTT

3174 silver badges10 bronze badges

Comments

Saket Mittal · Accepted Answer · 2017-06-21 12:36:31Z

0

>>> actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
>>> predicted=["Red", "Apple", "Green", "Peace", "Orange"]
>>> set(actual) & set(predicted)
set(['Orange', 'Peace', 'Apple'])

answered Jun 21, 2017 at 12:36

Saket Mittal

3,9123 gold badges32 silver badges49 bronze badges

Comments

Neeraj Meshram · Accepted Answer · 2017-06-21 12:46:23Z

0

I this case you only have to check if i'th element of predicted list is present in actual list or not. if present, then add to new list.

In [2]: actual=["Apple", "Appl", "Orange", "Ornge", "Peace"]
...: predicted=["Red", "Apple", "Green", "Peace", "Orange"]


In [3]: [i for i in predicted if i in actual]
Out[3]: ['Apple', 'Peace', 'Orange']

answered Jun 21, 2017 at 12:46

Neeraj Meshram

32 bronze badges

Comments

Filip Happy · Accepted Answer · 2017-06-21 12:47:30Z

0

Simple approach, but NOT effective, would be:

counter = 0
for item in b:
    if SequenceMatcher(None, a, item).ratio() > 0:
        counter += 1

This is what you want, the number of fuzzy matched elements, not only the same elements (as offered by most other answers).

answered Jun 21, 2017 at 12:47

Filip Happy

6346 silver badges19 bronze badges

Comments

rodgdor · Accepted Answer · 2017-06-21 12:52:33Z

0

First take the intersection of the two sets:

actual, predicted = set(actual), set(predicted)

exact = actual.intersection(predicted)

If this comprises all your actual words then you're done. However,

if len(exact) < len(actual):
    fuzzy = [word for word in actual-predicted for match in predicted if similar(word, match)>0.8]

Finally your resulting set is exact.union(set(fuzzy))

answered Jun 21, 2017 at 12:52

rodgdor

2,6401 gold badge22 silver badges28 bronze badges

Comments

ash · Accepted Answer · 2017-06-21 14:54:53Z

0

You can also try the following approach to achieve your requirement:

import itertools

fuzlist = [ "Appl", "Orange", "Ornge", "Peace"]
actlist = ["Red", "Apple", "Green", "Peace", "Orange"]
foundlist = []
for fuzname in fuzlist:
    for name in actlist:
        for actname in itertools.permutations(name):
            if fuzname.lower() in ''.join(actname).lower():
                foundlist.append(name)
                break

print set(foundlist)

edited Jun 21, 2017 at 14:54

answered Jun 21, 2017 at 14:16

ash

13 bronze badges

Collectives™ on Stack Overflow

Python - Matching strings from 2 lists

8 Answers 8

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

Comments

2 Comments

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related