Compare binary files in C#

Question

I want to compare two binary files. One of them is already stored on the server with a pre-calculated CRC32 in the database from when I stored it originally.

I know that if the CRC is different, then the files are definitely different. However, if the CRC is the same, I don't know that the files are. So, I'm looking for a nice efficient way of comparing the two streams: one from the posted file and one from the file system.

I'm not an expert on streams, but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.

Mehrdad Afshari · Accepted Answer · 2016-02-18 22:05:49Z

44

static bool FileEquals(string fileName1, string fileName2)
{
    // Check the file size and CRC equality here.. if they are equal...    
    using (var file1 = new FileStream(fileName1, FileMode.Open))
        using (var file2 = new FileStream(fileName2, FileMode.Open))
            return FileStreamEquals(file1, file2);
}

static bool FileStreamEquals(Stream stream1, Stream stream2)
{
    const int bufferSize = 2048;
    byte[] buffer1 = new byte[bufferSize]; //buffer size
    byte[] buffer2 = new byte[bufferSize];
    while (true) {
        int count1 = stream1.Read(buffer1, 0, bufferSize);
        int count2 = stream2.Read(buffer2, 0, bufferSize);

        if (count1 != count2)
            return false;

        if (count1 == 0)
            return true;

        // You might replace the following with an efficient "memcmp"
        if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
            return false;
    }
}

edited Feb 18, 2016 at 22:05

answered Jun 9, 2009 at 9:05

Mehrdad Afshari

424k93 gold badges864 silver badges796 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

Karata Over a year ago

Require conunt1 == count2 could be inaccurate, as Stream.Read is free to return a block that has a length less than requested byte count. see msdn.microsoft.com/en-us/library/vstudio/…

Ozgur Ozturk Over a year ago

Thanks for the solution Mehrdad. Do you need the Take calls? I tried only if (!buffer1.SequenceEqual(buffer2)) and it seems to work.

Mehrdad Afshari Over a year ago

@Ozgur it works but it is less efficient and not very principled IMO.

Palec Over a year ago

The docs say it may be a problem even for FileStream. Do you mean it is usually not a problem? Or are the docs misleading?

Peter - Reinstate Monica Over a year ago

@Karata is absolutely right, and this faulty code should not stand uncorrected (somebody may use it on a piece of software I use later). At the very least FileStreamEquals should take two properly typed FileStream arguments; a weak case can probably be made that usually a Read request for n bytes from a file indeed reads n bytes if nothing went wrong. But would you bet your life (or your company) on every contingency? What about network mapped drives? Named pipes?

|

Lars · Accepted Answer · 2010-04-14 14:44:35Z

23

I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 2048 * 2;
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }

edited Apr 14, 2010 at 14:44

answered Apr 14, 2010 at 12:30

Lars

6678 silver badges19 bronze badges

4 Comments

Pretzel Over a year ago

Is this advantageous only for 64-bit CPUs or will this help on 32-bit CPUs as well?

Lars Over a year ago

It should not matter if you have a 32-bit or 64-bit operating system running. But I never tried it on a pure 32-bit CPU. You have to try it and maybe just change Int64 to int32. But aren't most of more or less modern CPUs capable of 64-bit operations (x86 since 2004)? Go ahead and try it!

T.J. Crowder Over a year ago

See comments on this answer. Relying on count1 equalling count2 is not reliable.

Dan Bechard Over a year ago

Does this handle the last 0-7 bytes correctly? Did you test it on two files where fileSize % sizeof(Int64) > 0 and only the last byte is different?

wpp · Accepted Answer · 2016-02-13 09:25:25Z

9

This is how I would do it if you didn't want to rely on crc:

    /// <summary>
    /// Binary comparison of two files
    /// </summary>
    /// <param name="fileName1">the file to compare</param>
    /// <param name="fileName2">the other file to compare</param>
    /// <returns>a value indicateing weather the file are identical</returns>
    public static bool CompareFiles(string fileName1, string fileName2)
    {
        FileInfo info1 = new FileInfo(fileName1);
        FileInfo info2 = new FileInfo(fileName2);
        bool same = info1.Length == info2.Length;
        if (same)
        {
            using (FileStream fs1 = info1.OpenRead())
            using (FileStream fs2 = info2.OpenRead())
            using (BufferedStream bs1 = new BufferedStream(fs1))
            using (BufferedStream bs2 = new BufferedStream(fs2))
            {
                for (long i = 0; i < info1.Length; i++)
                {
                    if (bs1.ReadByte() != bs2.ReadByte())
                    {
                        same = false;
                        break;
                    }
                }
            }
        }

        return same;
    }

edited Feb 13, 2016 at 9:25

wpp

7,3834 gold badges38 silver badges66 bronze badges

answered Aug 23, 2013 at 14:27

JonPen

871 silver badge1 bronze badge

1 Comment

fbastian Over a year ago

info2 should be taking fileName2 as argument instead of fileName1. Otherwise, nice solution :-).

Larry · Accepted Answer · 2017-11-12 14:03:08Z

7

The accepted answer had an error that was pointed out, but never corrected: stream read calls are not guaranteed to return all bytes requested.

BinaryReader ReadBytes calls are guaranteed to return as many bytes as requested unless the end of the stream is reached first.

The following code takes advantage of BinaryReader to do the comparison:

    static private bool FileEquals(string file1, string file2)
    {
        using (FileStream s1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (FileStream s2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (BinaryReader b1 = new BinaryReader(s1))
        using (BinaryReader b2 = new BinaryReader(s2))
        {
            while (true)
            {
                byte[] data1 = b1.ReadBytes(64 * 1024);
                byte[] data2 = b2.ReadBytes(64 * 1024);
                if (data1.Length != data2.Length)
                    return false;
                if (data1.Length == 0)
                    return true;
                if (!data1.SequenceEqual(data2))
                    return false;
            }
        }
    }

edited Nov 12, 2017 at 14:03

answered Nov 11, 2017 at 16:05

Larry

3074 silver badges7 bronze badges

4 Comments

dafie Over a year ago

But, the downside of this is that it allocates both files in memory. Much more efficient is using multiple Read on FileStream to get missing bytes

Larry Over a year ago

@dafie No, it does not read entire files in memory as you seem to suggest. Yes there is some buffering but I think you'll find the code is very efficient, reading both files sequentially in 64k chunks.

dafie Over a year ago

@I've done some benchmarking: pastebin.com/raw/ky9D8ynd

Larry Over a year ago

@dafie That is useful. My post was not intended to show an absolutely optimal method, just a simple one that averted the serious bug in a previous posting. If I interpret your post correctly, the ForceRead approach (which was not posted previously) is a bit more efficient.

albertjan · Accepted Answer · 2009-06-09 08:58:25Z

3

if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small

answered Jun 9, 2009 at 8:58

albertjan

7,8376 gold badges47 silver badges78 bronze badges

6 Comments

Mehrdad Afshari Over a year ago

You should never rely on that in most serious apps. It's like just checking the hash in a hashtable lookup without comparing the actual keys!

Simon Farrow Over a year ago

unfortunately you can guarantee that the one time it messes up will be absolutely critical, probably that one big pitch.

albertjan Over a year ago

@Simon - hehe very true. @Mehrdad - No probably not but it would greatly reduce the times you'd have to check to be super uber sure.

kenny Over a year ago

Take the CRC and say file size and the changes are ever smaller.

Maate Over a year ago

@MehrdadAfshari a rather serious app like git relies on exactly this. To quote Linus Torvalds we will "quite likely never ever see it in [collision of two files by comparing sha's] the full history of the universe". Cf. stackoverflow.com/questions/9392365/….

|

Josh · Accepted Answer · 2009-06-09 09:04:14Z

You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.

But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.

You can use IntPtr.Size to determine the ideal size for the current processor architecture.

Łukasz Nojek · Accepted Answer · 2023-07-26 16:28:11Z

I took the previous answers, and added the logic from the source code of BinaryReader.ReadBytes to get a solution that does not recreate buffer in every loop and does not suffer from unexpected return values from FileStream.Read:

public static bool AreSame(string path1, string path2) {
    int BUFFER_SIZE = 64 * 1024;
    byte[] buffer1 = new byte[BUFFER_SIZE];
    byte[] buffer2 = new byte[BUFFER_SIZE];

    int ReadBytes(FileStream fs, byte[] buffer) {
        int totalBytes = 0;
        int count = buffer.Length;
        while (count > 0) {
            int readBytes = fs.Read(buffer, totalBytes, count);
            if (readBytes == 0)
                break;

            totalBytes += readBytes;
            count -= readBytes;
        }

        return totalBytes;
    }

    using (FileStream fs1 = new FileStream(path1, FileMode.Open, FileAccess.Read, FileShare.Read))
    using (FileStream fs2 = new FileStream(path2, FileMode.Open, FileAccess.Read, FileShare.Read)) {
        while (true) {
            int count1 = ReadBytes(fs1, buffer1);
            int count2 = ReadBytes(fs2, buffer2);

            if (count1 != count2)
                return false;

            if (count1 == 0)
                return true;

            if (count1 == BUFFER_SIZE) {
                if (!buffer1.SequenceEqual(buffer2))
                    return false;
            } else {
                if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
                    return false;
            }
        }
    }
}

Yotic · Accepted Answer · 2025-04-02 20:05:59Z

1

bool CompareBinaries(string path1, string path2)
{
    using var stream1 = new FileStream(path1, FileMode.Open, FileAccess.Read);
    using var stream2 = new FileStream(path1, FileMode.Open, FileAccess.Read);

    if (stream1.Length != stream2.Length)
        return false;
        
    return ReadChecksumFromBinary(stream1) == ReadChecksumFromBinary(stream2);
}

uint ReadChecksumFromBinary(Stream stream)
{
    // [[0x3C]  +  0x04       +  0x14               +  0x40    ]
    //  elfanew -> fileHeader -> fileOptionalHeader -> checksum
    return Read<uint>(Read<int>(0x3C) + 0x04 + 0x14 + 0x40);

    T Read<T>(int offset = 0) where T : unmanaged
    {
        Span<byte> buffer = stackalloc byte[sizeof(T)];
        stream.Position = offset;
        stream.Read(buffer);
        return **(T**)&buffer;
    }
}

Blazingly fast way for binary files (hope for crc32)
Works for all windows binary with PE32+ header, like .exe, .dll. .sys, etc

answered Apr 2 at 20:05

Yotic

2013 silver badges9 bronze badges

1 Comment

DiskJunky Aug 25 at 22:26

This only works for runtime (x86/x64 afaik) binaries, not arbitrary binary files.

Chizl · Accepted Answer · 2024-04-08 22:00:55Z

0

This is how I do it today with no loops. Hope this helps provide an alternative option.

public class FileCompare
{
    public bool IsFileSame(string filePath1, string filePath2) => 
        IsFileSame(new FileInfo(filePath1), new FileInfo(filePath2));

    public bool IsFileSame(FileInfo filePath1, FileInfo filePath2)
    {
        var retVal = false;

        if (filePath1.Exists && 
            filePath2.Exists && 
            filePath1.Length == filePath2.Length)
        {
            using (FileStream inputStream1 = File.OpenRead(filePath1.FullName))
            {
                using (FileStream inputStream2 = File.OpenRead(filePath2.FullName))
                {
                    using (MD5 mD = MD5.Create())
                    {
                        retVal = BitConverter.ToString(mD.ComputeHash(inputStream1))
                            .Equals(BitConverter.ToString(mD.ComputeHash(inputStream2)));
                    }
                }
            }
        }

        return retVal;
    }
}

answered Apr 8, 2024 at 22:00

Chizl

1918 bronze badges

1 Comment

DiskJunky Aug 25 at 22:22

Industry standard is to depreciate MD5 going forward. Solution works but is weak against potential modern computed collisions.

Collectives™ on Stack Overflow

Compare binary files in C#

9 Answers 9

12 Comments

4 Comments

1 Comment

4 Comments

6 Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

12 Comments

4 Comments

1 Comment

4 Comments

6 Comments

Comments

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related