2

I want to compare two files in two different network locations. The files can be several GB in size and sometime the file location can be separated by slow WAN.

I know how to generate SHA1 hashes in Python, but I heard of a method whereby one can hash a number of file parts, as opposed to the entire file, then compare the hashes of the parts. For example, 64KB from the start, "middle", and end of each file. Is this a legitimate method? How can it be done?

2
  • 2
    What if a change occured in a non-hashed part: both files would have the same hash but are different. I guess this is not what you expect. Commented Mar 13, 2012 at 8:47
  • That's a very good point. I think in this case I'll be ok because the files are video and don't get modified so much as replaced with an entirely new version. Commented Mar 13, 2012 at 21:25

2 Answers 2

2

Download only part of the file using

req = urllib2.Request(url)
req.headers['Range'] = 'bytes=%s-%s' % (start, end)
f = urllib2.urlopen(req)

Then you can hash the part you download:

s = f.read()
hashlib.sha1(s).hexdigest()

Of course to make sure the file are equivalent you stile have to hash every part of the file.

Sign up to request clarification or add additional context in comments.

Comments

0

perhaps you are thinking of hash lists or hash trees, which can be used to reduce data transfer (eg in bittorrent)? unfortunately they differ from what you remember in a couple of ways:

  • they still hash all of the file (but in pieces)
  • they are used not to reduce network cost in constructing the hash, but to detect changes in restricted areas so that less data needs to be transferred (for example, in bittorrent, to identify which part of a file must be downloaded)

as Sylvain Prat says above, hashing only a few parts of the file is not reliable because it will only detect changes to those parts, and not to the entire file.

in your case, you could calculate the hashlists locally to each data set (ie run the hash calculation on the local machine). then, by comparing which hashes match and which not, transfer across only the parts that are different (if that is what you need to do).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.