35

I want to compare two binary files. One of them is already stored on the server with a pre-calculated CRC32 in the database from when I stored it originally.

I know that if the CRC is different, then the files are definitely different. However, if the CRC is the same, I don't know that the files are. So, I'm looking for a nice efficient way of comparing the two streams: one from the posted file and one from the file system.

I'm not an expert on streams, but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.

9 Answers 9

44
static bool FileEquals(string fileName1, string fileName2)
{
    // Check the file size and CRC equality here.. if they are equal...    
    using (var file1 = new FileStream(fileName1, FileMode.Open))
        using (var file2 = new FileStream(fileName2, FileMode.Open))
            return FileStreamEquals(file1, file2);
}

static bool FileStreamEquals(Stream stream1, Stream stream2)
{
    const int bufferSize = 2048;
    byte[] buffer1 = new byte[bufferSize]; //buffer size
    byte[] buffer2 = new byte[bufferSize];
    while (true) {
        int count1 = stream1.Read(buffer1, 0, bufferSize);
        int count2 = stream2.Read(buffer2, 0, bufferSize);

        if (count1 != count2)
            return false;

        if (count1 == 0)
            return true;

        // You might replace the following with an efficient "memcmp"
        if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
            return false;
    }
}
Sign up to request clarification or add additional context in comments.

12 Comments

Require conunt1 == count2 could be inaccurate, as Stream.Read is free to return a block that has a length less than requested byte count. see msdn.microsoft.com/en-us/library/vstudio/…
Thanks for the solution Mehrdad. Do you need the Take calls? I tried only if (!buffer1.SequenceEqual(buffer2)) and it seems to work.
@Ozgur it works but it is less efficient and not very principled IMO.
The docs say it may be a problem even for FileStream. Do you mean it is usually not a problem? Or are the docs misleading?
@Karata is absolutely right, and this faulty code should not stand uncorrected (somebody may use it on a piece of software I use later). At the very least FileStreamEquals should take two properly typed FileStream arguments; a weak case can probably be made that usually a Read request for n bytes from a file indeed reads n bytes if nothing went wrong. But would you bet your life (or your company) on every contingency? What about network mapped drives? Named pipes?
|
23

I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 2048 * 2;
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }

4 Comments

Is this advantageous only for 64-bit CPUs or will this help on 32-bit CPUs as well?
It should not matter if you have a 32-bit or 64-bit operating system running. But I never tried it on a pure 32-bit CPU. You have to try it and maybe just change Int64 to int32. But aren't most of more or less modern CPUs capable of 64-bit operations (x86 since 2004)? Go ahead and try it!
See comments on this answer. Relying on count1 equalling count2 is not reliable.
Does this handle the last 0-7 bytes correctly? Did you test it on two files where fileSize % sizeof(Int64) > 0 and only the last byte is different?
9

This is how I would do it if you didn't want to rely on crc:

    /// <summary>
    /// Binary comparison of two files
    /// </summary>
    /// <param name="fileName1">the file to compare</param>
    /// <param name="fileName2">the other file to compare</param>
    /// <returns>a value indicateing weather the file are identical</returns>
    public static bool CompareFiles(string fileName1, string fileName2)
    {
        FileInfo info1 = new FileInfo(fileName1);
        FileInfo info2 = new FileInfo(fileName2);
        bool same = info1.Length == info2.Length;
        if (same)
        {
            using (FileStream fs1 = info1.OpenRead())
            using (FileStream fs2 = info2.OpenRead())
            using (BufferedStream bs1 = new BufferedStream(fs1))
            using (BufferedStream bs2 = new BufferedStream(fs2))
            {
                for (long i = 0; i < info1.Length; i++)
                {
                    if (bs1.ReadByte() != bs2.ReadByte())
                    {
                        same = false;
                        break;
                    }
                }
            }
        }

        return same;
    }

1 Comment

info2 should be taking fileName2 as argument instead of fileName1. Otherwise, nice solution :-).
7

The accepted answer had an error that was pointed out, but never corrected: stream read calls are not guaranteed to return all bytes requested.

BinaryReader ReadBytes calls are guaranteed to return as many bytes as requested unless the end of the stream is reached first.

The following code takes advantage of BinaryReader to do the comparison:

    static private bool FileEquals(string file1, string file2)
    {
        using (FileStream s1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (FileStream s2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (BinaryReader b1 = new BinaryReader(s1))
        using (BinaryReader b2 = new BinaryReader(s2))
        {
            while (true)
            {
                byte[] data1 = b1.ReadBytes(64 * 1024);
                byte[] data2 = b2.ReadBytes(64 * 1024);
                if (data1.Length != data2.Length)
                    return false;
                if (data1.Length == 0)
                    return true;
                if (!data1.SequenceEqual(data2))
                    return false;
            }
        }
    }

4 Comments

But, the downside of this is that it allocates both files in memory. Much more efficient is using multiple Read on FileStream to get missing bytes
@dafie No, it does not read entire files in memory as you seem to suggest. Yes there is some buffering but I think you'll find the code is very efficient, reading both files sequentially in 64k chunks.
@I've done some benchmarking: pastebin.com/raw/ky9D8ynd
@dafie That is useful. My post was not intended to show an absolutely optimal method, just a simple one that averted the serious bug in a previous posting. If I interpret your post correctly, the ForceRead approach (which was not posted previously) is a bit more efficient.
3

if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small

6 Comments

You should never rely on that in most serious apps. It's like just checking the hash in a hashtable lookup without comparing the actual keys!
unfortunately you can guarantee that the one time it messes up will be absolutely critical, probably that one big pitch.
@Simon - hehe very true. @Mehrdad - No probably not but it would greatly reduce the times you'd have to check to be super uber sure.
Take the CRC and say file size and the changes are ever smaller.
@MehrdadAfshari a rather serious app like git relies on exactly this. To quote Linus Torvalds we will "quite likely never ever see it in [collision of two files by comparing sha's] the full history of the universe". Cf. stackoverflow.com/questions/9392365/….
|
3

You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.

But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.

You can use IntPtr.Size to determine the ideal size for the current processor architecture.

Comments

2

I took the previous answers, and added the logic from the source code of BinaryReader.ReadBytes to get a solution that does not recreate buffer in every loop and does not suffer from unexpected return values from FileStream.Read:

public static bool AreSame(string path1, string path2) {
    int BUFFER_SIZE = 64 * 1024;
    byte[] buffer1 = new byte[BUFFER_SIZE];
    byte[] buffer2 = new byte[BUFFER_SIZE];

    int ReadBytes(FileStream fs, byte[] buffer) {
        int totalBytes = 0;
        int count = buffer.Length;
        while (count > 0) {
            int readBytes = fs.Read(buffer, totalBytes, count);
            if (readBytes == 0)
                break;

            totalBytes += readBytes;
            count -= readBytes;
        }

        return totalBytes;
    }

    using (FileStream fs1 = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.Read))
    using (FileStream fs2 = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.Read)) {
        while (true) {
            int count1 = ReadBytes(fs1, buffer1);
            int count2 = ReadBytes(fs2, buffer2);

            if (count1 != count2)
                return false;

            if (count1 == 0)
                return true;

            if (count1 == BUFFER_SIZE) {
                if (!buffer1.SequenceEqual(buffer2))
                    return false;
            } else {
                if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
                    return false;
            }
        }
    }
}

Comments

1
bool CompareBinaries(string path1, string path2)
{
    using var stream1 = new FileStream(path1, FileMode.Open, FileAccess.Read);
    using var stream2 = new FileStream(path1, FileMode.Open, FileAccess.Read);

    if (stream1.Length != stream2.Length)
        return false;
        
    return ReadChecksumFromBinary(stream1) == ReadChecksumFromBinary(stream2);
}

uint ReadChecksumFromBinary(Stream stream)
{
    // [[0x3C]  +  0x04       +  0x14               +  0x40    ]
    //  elfanew -> fileHeader -> fileOptionalHeader -> checksum
    return Read<uint>(Read<int>(0x3C) + 0x04 + 0x14 + 0x40);

    T Read<T>(int offset = 0) where T : unmanaged
    {
        Span<byte> buffer = stackalloc byte[sizeof(T)];
        stream.Position = offset;
        stream.Read(buffer);
        return **(T**)&buffer;
    }
}

Blazingly fast way for binary files (hope for crc32)
Works for all windows binary with PE32+ header, like .exe, .dll. .sys, etc

1 Comment

This only works for runtime (x86/x64 afaik) binaries, not arbitrary binary files.
0

This is how I do it today with no loops. Hope this helps provide an alternative option.

public class FileCompare
{
    public bool IsFileSame(string filePath1, string filePath2) => 
        IsFileSame(new FileInfo(filePath1), new FileInfo(filePath2));

    public bool IsFileSame(FileInfo filePath1, FileInfo filePath2)
    {
        var retVal = false;

        if (filePath1.Exists && 
            filePath2.Exists && 
            filePath1.Length == filePath2.Length)
        {
            using (FileStream inputStream1 = File.OpenRead(filePath1.FullName))
            {
                using (FileStream inputStream2 = File.OpenRead(filePath2.FullName))
                {
                    using (MD5 mD = MD5.Create())
                    {
                        retVal = BitConverter.ToString(mD.ComputeHash(inputStream1))
                            .Equals(BitConverter.ToString(mD.ComputeHash(inputStream2)));
                    }
                }
            }
        }

        return retVal;
    }
}

1 Comment

Industry standard is to depreciate MD5 going forward. Solution works but is weak against potential modern computed collisions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.