2

I have a set of hexadecimal numbers that I would like to search a file, which each time this matches it stores the last offset address of the last byte found in the pattern from the text file.

So for example, the pattern is '04 00 03 00 00 00 04 00 04 00 00 00 08 00' then each time this was found in the file then the last byte in the pattern which is '00' the offset is read in and stored in an array.

Any help on this please, this has been driving me mad for months now.

Thanks


Thanks Svick It did work indeed ... It returned all the matches which were the offsets I needed to find.

However, since I added little a bit of my code in, it stops on the first match and doesn't cycle through... Could someone please point out what is wrong and why it is stopping at the first match found

Many thanks

3
  • 1
    You should really post some relevant specs and code for more detail. Commented Jul 2, 2011 at 22:39
  • I think you should ask a new question for this. But I have no idea what your new code does and it doesn't seem to be related to your original question: I don't see any pattern searching there. Commented Jul 6, 2011 at 10:27
  • Read my answer to problem in this question. Commented Jun 27, 2016 at 8:28

3 Answers 3

3

I'm assuming you get the pattern as a string. What you should do is to convert it into an array of bytes.

If the file to search is relatively small and efficiency is not very important you could do it simply by starting search from each byte in the file and compare with the pattern byte by byte. If you go through the whole pattern and all bytes match, you store the position in your result array. In code, it would go something like this:

byte[] pattern = …;
byte[] file = …;

var result = Enumerable.Range(0, file.Length - pattern.Length + 1)
                       .Where(i => pattern.Select((b, j) => new { j, b })
                                          .All(p => file[i + p.j] == p.b))
                       .Select(i => i + pattern.Length - 1);

The time complexity of this solution is O(HN), where H is the length of the file (“haystack”) and N is the length of the pattern (“needle”).

If the file is big or if you need it fast, you should use the KMP algorithm, which has the complexity of O(H + N).

Sign up to request clarification or add additional context in comments.

3 Comments

This is what I have done with your code .... 'code' byte[] pattern = new byte[5] { 00, 00, 00, 08, 00 }; byte[] file = File.ReadAllBytes("C:\\123.bin"); var result = Enumerable.Range(0, file.Length - pattern.Length + 1) .Where(i => pattern.Select((b, j) => new { j, b }) .All(p => file[i + p.j] == p.b)) .Select(i => i + pattern.Length - 1); Console.WriteLine(result);'code' However, all it outputs is Enumerable.Range(0, file.Length - pattern.Length + 1)
Your code should output System.Linq.Enumerable+WhereSelectEnumerableIterator`2[System.Int32,System.Int32]. If you want to see the items in the resulting sequence, use foreach.
Sorry to bother you ... Is this correct codeforeach (var value in result) { Console.WriteLine(value); // Write the value. Console.ReadKey(); }code
2

Seems like you have several steps:

  1. Reading the pattern (needle) from your text file
  2. Converting the pattern from hexadecimal ASCII characters into an array of bytes
  3. Reading your binary file (haystack) into memory
  4. Finding the needle in the haystack

If the entire haystack fits into memory at once, this is quite simple. std::find and memcmp can take care of step 4 (oops, those are C++) C# strings can contain NUL bytes, so you should just be able to use the string IndexOf function here. You'll be given the offset of the beginning of the pattern, simply add the length of the pattern (minus one) to get the offset of the end byte. Or just write the search yourself with a couple for loops. See @svick's excellent answer for a discussion of step #4.

To handle larger haystack files, process a block at a time, make sure to overlap by the length of the pattern minus one, otherwise you could miss a match that crosses block boundaries.

2 Comments

Be careful with string.IndexOf(). By default it compares the strings using the default culture, which might have unforeseen consequences when used with strings that are meant to represent arrays of bytes.
@svick: Yeah, it's unfortunate that .NET has Buffer.BlockCopy but no BlockCompare or BlockFind type methods.
2

My idea was to use a Queue to "seek" through the input.

static void Main(string[] args)
{
    byte[] pattern = new byte[] { 3, 2, 1 };
    byte[] data = new byte[] { 1, 2, 3, 4, 5, 4, 3, 2, 1, 2, 3, 4, 5, 4, 3, 2, 1 };
    foreach (long offset in FindPattern(pattern, data))
    {
        Console.WriteLine(offset);
    }

    Console.ReadLine();
}

public static IEnumerable<long> FindPattern(byte[] pattern, byte[] data)
{
    Queue<byte> queue = new Queue<byte>(pattern.Length);
    using (MemoryStream input = new MemoryStream(data))
    using (BinaryReader reader = new BinaryReader(input))
    {
        byte[] buffer = new byte[1];
        while (1 == reader.Read(buffer, 0, 1))
        {
            if (queue.Count == pattern.Length)
            {
                queue.Dequeue();
            }

            queue.Enqueue(buffer[0]);
            if (Matches(queue, pattern))
            {
                // The input is positioned after the last read byte, which
                // completed the pattern.
                yield return input.Position;
            }
        }
    }
}

private static bool Matches(Queue<byte> data, byte[] pattern)
{
    return data.SequenceEqual(pattern);
}

I'm sure one use a custom "Queue" that performs a better comparison than SequenceEquals(), but you get the idea.

2 Comments

Why do you call ToArray() on the queue when comparing? All it does is to add some overhead.
Just a mistake, I just wrote that for the answer, it isn't actually used code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.