I'm looking for duplicate files by compare the filenames.
However, I found some paths returned by os.walk contain escaped chars. For example, I may get structure in the Earth\'s core.pdf for one file and structure in the Earth\xe2\x80\x99s core.pdf for another.
In [1]: print 'structure in the Earth\'s core.pdf\nstructure in the Earth\xe2\x80\x99s core.pdf'
structure in the Earth's core.pdf
structure in the Earth’s core.pdf
In [2]: 'structure in the Earth\'s core.pdf' == 'structure in the Earth\xe2\x80\x99s core.pdf'
Out[2]: False
How do I deal with these cases?
==== Just to clarify the Q in response to the comments, there are also other situations for duplicate files like
- one filename containing more spaces than the other
- one filename separated by
-while the other by: - one filename containing Japanese/Chinese words and the other composed of digits and Japanese/Chinese words ...
'is not equal to’. You replace one with the other or compare only the alpha-numerics of a given sentence.'is not’. But the two files are the same. For some reasons they were not named exactly the same. There are other situations like one filename containing more spaces than the other, one filename separated by-while the other by:, some filenames containing non-letter chars as Japanese/Chinese words... These's my difficulties now.