mysql / file hash question

Question

I'd like to write a script that traverses a file tree, calculates a hash for each file, and inserts the hash into an SQL table together with the file path, such that I can then query and search for files that are identical. What would be the recommended hash function or command like tool to create hashes that are extremely unlikely to be identical for different files? Thanks B

Joey Adams · Accepted Answer · 2010-09-26 01:13:42Z

1

I've been working on this problem for much too long. I'm on my third (and hopefully final) rewrite.

Generally speaking, I recommend SHA1 because it has no known collisions (whereas MD5 collisions can be found in minutes), and SHA1 doesn't tend to be a bottleneck when working with hard disks. If you're obsessed with getting your program to run fast in the presence of a solid-state drive, either go with MD5, or waste days and days of your time figuring out how to parallelize the operation. In any case, do not parallelize hashing until your program does everything you need it to do.

Also, I recommend using sqlite3. When I made my program store file hashes in a PostgreSQL database, the database insertions were a real bottleneck. Granted, I could have tried using COPY (I forget if I did or not), and I'm guessing that would have been reasonably fast.

If you use sqlite3 and perform the insertions in a BEGIN/COMMIT block, you're probably looking at about 10000 insertions per second in the presence of indexes. However, what you can do with the resulting database makes it all worthwhile. I did this with about 750000 files (85 GB). The whole insert and SHA1 hash operation took less than an hour, and it created a 140MB sqlite3 file. However, my query to find duplicate files and sort them by ID takes less than 20 seconds to run.

In summary, using a database is good, but note the insertion overhead. SHA1 is safer than MD5, but takes about 2.5x as much CPU power. However, I/O tends to be the bottleneck (CPU is a close second), so using MD5 instead of SHA1 really won't save you much time.

answered Sep 26, 2010 at 1:13

community wiki

Joey Adams

Sign up to request clarification or add additional context in comments.

5 Comments

b20000 Over a year ago

how far along are you with your tool? I've been looking for a simple tool that does this for ages but couldn't find anything online beyond the obvious "compare two directories" shareware tools.

Joey Adams Over a year ago

My program is already capable of loading file tree information into a database and hashing files; it works fabulously. I'm currently working on the problem of replacing duplicate files with hardlinks. Note that my program will probably only work on Linux and other Unix-like systems because it's tied to the stat structure filled in by the lstat() function.

Joey Adams Over a year ago

Also, it has absolutely no frontend yet; you would have to paste in the path you want to scan, and for more complicated operations, learn how to work with Haskell code.

TolMera Over a year ago

@JoeyAdams Hey I see this is kind of old, but wondered if you had published this anywhere? I would find it really useful (As would many people) and really don't want to write one up from scratch myself. Github?

Joey Adams Over a year ago

@TolMera: Thanks for your interest. I might actually come back around to it soon, as I have a bunch of data I want to deduplicate. Though I was probably overengineering this; you could start with something like find -type f -exec sha1sum {} \; > hashes.txt and analyze the resulting text file.

jcubic · Accepted Answer · 2010-09-26 00:21:55Z

0

you can use md5 hash or sha1

  function process_dir($path) {

    if ($handle = opendir($path)) {
      while (false !== ($file = readdir($handle))) {
        if ($file != "." && $file != "..") {
           if (is_dir($path . "/" . $file)) {
              process_dir($path . "/" . $file);
           } else {
              //you can change md5 to sh1
              // you can put that hash into database
              $hash = md5(file_get_contents($path . "/" . $file)); 
           }
        }
      }
      closedir($handle);
  }
 }

if you working in Windows change slashes to backslashes.

answered Sep 26, 2010 at 0:21

community wiki

jcubic

Comments

2 revs · Accepted Answer · 2011-04-26 20:45:16Z

0

Here's a solution I figured out. I didn't do all of it in PHP though it'd be easy enough to do if you wanted:

$fh = popen('find /home/admin -type f | xargs sha1sum', 'r');
$files = array();
while ($line = fgets($fh)) {
    list($hash,$file) = explode('  ', trim($line));

    $files[$hash][] = $file;
}
$dupes = array_filter($files, function($a) { return count($a) > 1; });

I realise I've not used databases here. How many files are you going to be indexing? Do you need to put that data into a database and then search for the dupes there?

edited Apr 26, 2011 at 20:45

community wiki

2 revs
James C

1 Comment

b20000 Over a year ago

thanks - i've written a script in the meantime that uses an sqllite DB

Collectives™ on Stack Overflow

mysql / file hash question

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related