2

I have a list of strings and I want to keep only the most unique strings. Here is how I have implemented this (maybe there's an issue with the loop),

def filter_descriptions(descriptions):
    MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar
    i = 0
    while i < len(descriptions):
        print("Processing {}/{}...".format(i + 1, len(descriptions)))
        desc_to_evaluate = descriptions[i]
        j = i + 1
        while j < len(descriptions):
            similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
            if similarity_ratio > MAX_SIMILAR_ALLOWED:
                del descriptions[j]
            j += 1
        i += 1
    return descriptions

Please note that the list might have around 110K items which is why I am shortening the list every iteration.

Can anyone please identify what is wrong with this current implementation?

Edit 1:

The current results are "too similar". The filter_descriptions function returned 16 items (from a list of ~110K items). When I tried the following,

SequenceMatcher(None, descriptions[0], descriptions[1]).ratio() 

The ratio was 0.99, and with SequenceMatcher(None, descriptions[1], descriptions[2]).ratio() it was around 0.98. But with SequenceMatcher(None, descriptions[0], descriptions[15]).ratio() it was around 0.65 (which is better)

I hope this helps.

2
  • I couldn't understand the problem you are facing with your first implementation? Can you explain clearly what's wrong with it? What does it mean by "the final strings are too similar"? and help us understand how is it a problem Commented Oct 23, 2018 at 6:18
  • @JohnDoe please see Edit 1 Commented Oct 23, 2018 at 6:27

3 Answers 3

2

If you invert your logic, you can escape having to modify the list in place and still reduce the number of comparisons needed. That is, start with an empty output/unique list and iterate over your descriptions seeing if you can add each one. So for the first description you can add it immediately as it cannot be similar to anything in an empty list. The second description only needs to be compared to the first as opposed to all other descriptions. Later iterations can short circuit as soon as they find a previous description with which they are similar to (and have the candidate description be discarded). ie.

import operator

def unique(items, compare=operator.eq):
    # compare is a function that returns True if its two arguments are deemed similar to 
    # each other and False otherwise.

    unique_items = []
    for item in items:
        if not any(compare(item, uniq) for uniq in unique_items):
            # any will stop as soon as compare(item, uniq) returns True
            # you could also use `if all(not compare(item, uniq) ...` if you prefer
            unique_items.append(item)

    return unique_items

Examples:

assert unique([2,3,4,5,1,2,3,3,2,1]) == [2, 3, 4, 5, 1]
# note that order is preserved

assert unique([1, 2, 0, 3, 4, 5], compare=(lambda x, y: abs(x - y) <= 1))) == [1, 3, 5]
# using a custom comparison function we can exclude items that are too similar to previous
# items. Here 2 and 0 are excluded because they are too close to 1 which was accepted
# as unique first. Change the order of 3 and 4, and then 5 would also be excluded.

With your code your comparison function would look like:

MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar

def description_cmp(candidate_desc, unique_desc):
    # use unique_desc as first arg as this keeps the argument order the same as with your filter 
    # function where the first description is the one that is retained if the two descriptions 
    # are deemed to be too similar
    similarity_ratio = SequenceMatcher(None, unique_desc, candidate_desc).ratio()
    return similarity_ratio > MAX_SIMILAR_ALLOWED

def filter_descriptions(descriptions):
    # This would be the new definition of your filter_descriptions function
    return unique(descriptions, compare=descriptions_cmp)

The number of comparisons should be exactly the same. That is, in your implementation the first element is compared to all the others, and the second element is only compared to elements that were deemed not similar to the first element and so on. In this implementation the first item is not compared to anything initially, but all other items must be compared to it to be allowed to be added to the unique list. Only items deemed not similar to the first item will be compared to the second unique item, and so on.

The unique implementation will do less copying as it only has to copy the unique list when the backing array runs out of space. Whereas, with the del statement parts of the list must be copied each time it is used (to move all subsequent items into their new correct position). This will likely have a negligible impact on performance though, as the bottleneck is probably the ratio calculation in the sequence matcher.

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for bringing a different approach to handle this @Dunes. I will benchmark both approaches and share the results here.
SequenceMatcher took ~3 mins 10 secs, Invert Logic took ~100 microseconds
Microseconds? That sounds far too quick... The CPU can execute only about 1-10 instructions per item in your input list at that speed. Makes me think there is something wrong with my logic or the test you carried out.
My own microbench mark shows the unique version to 10 times faster with a list where everything is similar and 2 times faster with a list where everything is not similar. Nothing like the 7 orders of magnitude you're reporting. Your version is destructive (modifies the input list). Did you run my version on the same list after having run your version on it?
In all the excitement I might have got the unit wrong :) Here is the output of datetime.now(). Start SequenceMatcher: 2018-10-23 16:43:37.221874 End SequenceMatcher: 2018-10-23 16:46:47.316437 Start Invert: 2018-10-23 16:46:47.316519 End Invert: 2018-10-23 16:46:47.333900
|
1

The Problem with your logic is that each time when you delete an item from the array, the index gets re-arranged and skips a string in between. Eg:

Assume that this is the array: Description : ["A","A","A","B","C"]

iterartion 1:

i=0                      -------------0
description[i]="A"
j=i+1                    -------------1
description[j]="A"
similarity_ratio>0.6
del description[j]

Now the array is re-indexed like: Description:["A","A","B","C"]. The next step is:

 j=j+1                   ------------1+1= 2

Description[2]="B"

You have skipped Description[1]="A"


To fix this : Replace

j+=1 

With

j=i+1

if deleted. Else do the normal j=j+1 iteration

3 Comments

Yeah this seems to be the issue. But I think j=i+1 only needs to be done IF an item is deleted from the list. If no item is deleted then j should only be incremented (by one), not doing so will turn it into an infinite loop.
Yes, you have to keep the" j=j+1" for looping through the array elements. Apart from that, you have to add a "j=j+i" in case you delete an item.
@MJB Please accept / upvote the answer if it was useful
0

The value of j should not change when an item from the list is deleted (since a different list item will be present on that spot in the next iteration). Doing j=i+1 restarts the iteration every time an item is deleted (which is not what is desired). The updated code now only increments j in the else condition.

def filter_descriptions(descriptions):
    MAX_SIMILAR_ALLOWED = 0.6  #40% unique and 60% similar
    i = 0
    while i < len(descriptions):
        print("Processing {}/{}...".format(i + 1, len(descriptions)))
        desc_to_evaluate = descriptions[i]
        j = i + 1
        while j < len(descriptions):
            similarity_ratio = SequenceMatcher(None, desc_to_evaluate, descriptions[j]).ratio()
            if similarity_ratio > MAX_SIMILAR_ALLOWED:
                del descriptions[j]
            else:
                j += 1
        i += 1
    return descriptions

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.