3

I'd like to write a script that traverses a file tree, calculates a hash for each file, and inserts the hash into an SQL table together with the file path, such that I can then query and search for files that are identical. What would be the recommended hash function or command like tool to create hashes that are extremely unlikely to be identical for different files? Thanks B

3 Answers 3

1

I've been working on this problem for much too long. I'm on my third (and hopefully final) rewrite.

Generally speaking, I recommend SHA1 because it has no known collisions (whereas MD5 collisions can be found in minutes), and SHA1 doesn't tend to be a bottleneck when working with hard disks. If you're obsessed with getting your program to run fast in the presence of a solid-state drive, either go with MD5, or waste days and days of your time figuring out how to parallelize the operation. In any case, do not parallelize hashing until your program does everything you need it to do.

Also, I recommend using sqlite3. When I made my program store file hashes in a PostgreSQL database, the database insertions were a real bottleneck. Granted, I could have tried using COPY (I forget if I did or not), and I'm guessing that would have been reasonably fast.

If you use sqlite3 and perform the insertions in a BEGIN/COMMIT block, you're probably looking at about 10000 insertions per second in the presence of indexes. However, what you can do with the resulting database makes it all worthwhile. I did this with about 750000 files (85 GB). The whole insert and SHA1 hash operation took less than an hour, and it created a 140MB sqlite3 file. However, my query to find duplicate files and sort them by ID takes less than 20 seconds to run.

In summary, using a database is good, but note the insertion overhead. SHA1 is safer than MD5, but takes about 2.5x as much CPU power. However, I/O tends to be the bottleneck (CPU is a close second), so using MD5 instead of SHA1 really won't save you much time.

Sign up to request clarification or add additional context in comments.

5 Comments

how far along are you with your tool? I've been looking for a simple tool that does this for ages but couldn't find anything online beyond the obvious "compare two directories" shareware tools.
My program is already capable of loading file tree information into a database and hashing files; it works fabulously. I'm currently working on the problem of replacing duplicate files with hardlinks. Note that my program will probably only work on Linux and other Unix-like systems because it's tied to the stat structure filled in by the lstat() function.
Also, it has absolutely no frontend yet; you would have to paste in the path you want to scan, and for more complicated operations, learn how to work with Haskell code.
@JoeyAdams Hey I see this is kind of old, but wondered if you had published this anywhere? I would find it really useful (As would many people) and really don't want to write one up from scratch myself. Github?
@TolMera: Thanks for your interest. I might actually come back around to it soon, as I have a bunch of data I want to deduplicate. Though I was probably overengineering this; you could start with something like find -type f -exec sha1sum {} \; > hashes.txt and analyze the resulting text file.
0

you can use md5 hash or sha1

  function process_dir($path) {

    if ($handle = opendir($path)) {
      while (false !== ($file = readdir($handle))) {
        if ($file != "." && $file != "..") {
           if (is_dir($path . "/" . $file)) {
              process_dir($path . "/" . $file);
           } else {
              //you can change md5 to sh1
              // you can put that hash into database
              $hash = md5(file_get_contents($path . "/" . $file)); 
           }
        }
      }
      closedir($handle);
  }
 }

if you working in Windows change slashes to backslashes.

Comments

0

Here's a solution I figured out. I didn't do all of it in PHP though it'd be easy enough to do if you wanted:

$fh = popen('find /home/admin -type f | xargs sha1sum', 'r');
$files = array();
while ($line = fgets($fh)) {
    list($hash,$file) = explode('  ', trim($line));

    $files[$hash][] = $file;
}
$dupes = array_filter($files, function($a) { return count($a) > 1; });

I realise I've not used databases here. How many files are you going to be indexing? Do you need to put that data into a database and then search for the dupes there?

1 Comment

thanks - i've written a script in the meantime that uses an sqllite DB

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.