1

I'm looking for duplicate files by compare the filenames.

However, I found some paths returned by os.walk contain escaped chars. For example, I may get structure in the Earth\'s core.pdf for one file and structure in the Earth\xe2\x80\x99s core.pdf for another.

In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf

In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False

How do I deal with these cases?

==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like

  • one filename containing more spaces than the other
  • one filename separated by - while the other by :
  • one filename containing Japanese/Chinese words and the other composed of digits and Japanese/Chinese words ...
7
  • they are two different characters... ' is not equal to . You replace one with the other or compare only the alpha-numerics of a given sentence. Commented Oct 6, 2017 at 19:43
  • 1
    They aren't the same, because they are using different encoding to create the same general visual appearance. c.f. this link for a similar discussion. They are different characters, as @bulbus notes. Fixing that is complicated, as it opens a can of worms about how many possible ways there are to say something that is intellectually similar, but not literally the same. Commented Oct 6, 2017 at 19:44
  • You might try boiling them down to "dictionary" representation, stripping out all the non-alphanumerics before comparing, and writing a report. Commented Oct 6, 2017 at 19:47
  • I know ' is not . But the two files are the same. For some reasons they were not named exactly the same. There are other situations like one filename containing more spaces than the other, one filename separated by - while the other by :, some filenames containing non-letter chars as Japanese/Chinese words... These's my difficulties now. Commented Oct 6, 2017 at 19:58
  • @bulbus I don't have such a collection including all possible char pairs like this. Comparing letters and digits only may be a workaround. Commented Oct 6, 2017 at 20:03

1 Answer 1

1

Maybe you can get the similarity of the strings instead of an exact match. Get the exact match can be problematic because of simple things like capitalization.

I suggest the following:

from difflib import SequenceMatcher

s1 = "structure in the Earth\'s core.pdf"
s2 = "structure in the Earth\xe2\x80\x99s core.pdf"

matcher = SequenceMatcher()
matcher.set_seqs(s1, s2)
print(matcher.ratio())
# 0.9411764705882353

This result shows that the similarity between both strings is over 94%. You could define a threshold to delete or to review the items before deletion.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.