String compare in python

Question

I'm looking for duplicate files by compare the filenames.

However, I found some paths returned by os.walk contain escaped chars. For example, I may get structure in the Earth\'s core.pdf for one file and structure in the Earth\xe2\x80\x99s core.pdf for another.

In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf

In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False

How do I deal with these cases?

==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like

one filename containing more spaces than the other
one filename separated by - while the other by :
one filename containing Japanese/Chinese words and the other composed of digits and Japanese/Chinese words ...

they are two different characters... ' is not equal to ’. You replace one with the other or compare only the alpha-numerics of a given sentence. — kaza
– kaza, Commented Oct 6, 2017 at 19:43
They aren't the same, because they are using different encoding to create the same general visual appearance. c.f. this link for a similar discussion. They are different characters, as @bulbus notes. Fixing that is complicated, as it opens a can of worms about how many possible ways there are to say something that is intellectually similar, but not literally the same. — Paul Hodges
– Paul Hodges, Commented Oct 6, 2017 at 19:44
You might try boiling them down to "dictionary" representation, stripping out all the non-alphanumerics before comparing, and writing a report. — Paul Hodges
– Paul Hodges, Commented Oct 6, 2017 at 19:47
I know ' is not ’. But the two files are the same. For some reasons they were not named exactly the same. There are other situations like one filename containing more spaces than the other, one filename separated by - while the other by :, some filenames containing non-letter chars as Japanese/Chinese words... These's my difficulties now. — wsdzbm
– wsdzbm, Commented Oct 6, 2017 at 19:58
@bulbus I don't have such a collection including all possible char pairs like this. Comparing letters and digits only may be a workaround. — wsdzbm
– wsdzbm, Commented Oct 6, 2017 at 20:03

Arthur Gouveia · Accepted Answer · 2017-10-06 20:15:18Z

1

Maybe you can get the similarity of the strings instead of an exact match. Get the exact match can be problematic because of simple things like capitalization.

I suggest the following:

from difflib import SequenceMatcher

s1 = "structure in the Earth\'s core.pdf"
s2 = "structure in the Earth\xe2\x80\x99s core.pdf"

matcher = SequenceMatcher()
matcher.set_seqs(s1, s2)
print(matcher.ratio())
# 0.9411764705882353

This result shows that the similarity between both strings is over 94%. You could define a threshold to delete or to review the items before deletion.

edited Oct 6, 2017 at 20:15

answered Oct 6, 2017 at 20:02

Arthur Gouveia

7444 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

String compare in python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related