5

I need to write effective and quick method to search byte array for given pattern. I write it this way, what do you think , how to improve? And it has one bug, it cannot return match with length 1.

public static bool SearchByteByByte(byte[] bytes, byte[] pattern)
    {
        bool found = false;
        int matchedBytes = 0;
        for (int i = 0; i < bytes.Length; i++)
        {
            if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
            {
                for (int j = 1; j < pattern.Length; j++)
                {
                    if (bytes[i + j] == pattern[j])
                    {
                        matchedBytes++;
                        if (matchedBytes == pattern.Length - 1)
                        {
                            return true;
                        }
                        continue;
                    }
                    else
                    {
                        matchedBytes = 0;
                        break;
                    }
                }
            }
        }
        return found;
    }

Any suggestions ?

3
  • You might look at string search algorithms en.wikipedia.org/wiki/String_searching_algorithm Commented Mar 27, 2012 at 12:21
  • 3
    It would be nice to include the idea behind your algorithm rather than pasting your code without explanation. Commented Mar 27, 2012 at 12:23
  • Would converting your two bytes arrays to two comma-separated-Hexa-strings and do a regex matching be effective enough ? Commented Mar 27, 2012 at 13:58

3 Answers 3

8

The Boyer-Moore algorithm that is used in grep is pretty efficient, and gets more efficient for longer pattern sizes. I'm pretty sure you could make it work for a byte array without too much difficulty, and its wikipedia page has an implementation in Java that should be fairly easy to port to C#.

UPDATE:

Here's an implementation of a simplified version of the Boyer-Moore algorithm for byte arrays in C#. It only uses the second jump table of the full algorithm. Based on the array sizes that you said (haystack: 2000000 bytes, needle: 10 bytes), it's about 5-8 times faster than a simple byte by byte algorithm.

    static int SimpleBoyerMooreSearch(byte[] haystack, byte[] needle)
    {
        int[] lookup = new int[256];
        for (int i = 0; i < lookup.Length; i++) { lookup[i] = needle.Length; }

        for (int i = 0; i < needle.Length; i++)
        {
            lookup[needle[i]] = needle.Length - i - 1;
        }

        int index = needle.Length - 1;
        var lastByte = needle.Last();
        while (index < haystack.Length)
        {
            var checkByte = haystack[index];
            if (haystack[index] == lastByte)
            {
                bool found = true;
                for (int j = needle.Length - 2; j >= 0; j--)
                {
                    if (haystack[index - needle.Length + j + 1] != needle[j])
                    {
                        found = false;
                        break;
                    }
                }

                if (found)
                    return index - needle.Length + 1;
                else
                    index++;
            }
            else
            {
                index += lookup[checkByte];
            }
        }
        return -1;
    }
Sign up to request clarification or add additional context in comments.

5 Comments

I have read, that The Boyer-Moore algorithm is not so fast on big alphabet. Byte alphabet is 256. It will be fast?
To be honest, I'm not sure. You can try benchmarking it but I'm afraid I can't tell you without benchmarking it myself. Out of interest, how long is your pattern?
About 10 bytes, but array where i need to search is about 2 000 000 items long.
Could be this code modified with one's more input parameter? Index of haystack from where to begin search? I know i can copy array and remove first item, but in large array it's slow, so start search from given index will be better.
Yes, very easily. Just add an int index argument to the method, so as its signature looks like this: static int SimpleBoyerMooreSearch(byte[] haystack, byte[] needle, int index), and delete the int index assignment in the method body. Then increase the index by needle.Length - 1, which is necessary due to the way the algorithm works (i.e. starting at the end of the needle and working back).
2

And it has one bug, it cannot return match with length 1

To fix this, start inner loop from zero:

public static bool SearchByteByByte(byte[] bytes, byte[] pattern)
{
    bool found = false;
    int matchedBytes = 0;
    for (int i = 0; i < bytes.Length; i++)
    {

        if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
        {

            for (int j = 0; j < pattern.Length; j++) // start from 0
            {
                if (bytes[i + j] == pattern[j]) 
                {
                    matchedBytes++;
                    if (matchedBytes == pattern.Length) // remove - 1                    
                        return true;

                    continue;
                }
                else
                {
                    matchedBytes = 0;
                    break;
                }
            }
        }
    }
    return found;
}

UPDATE: Here is your searching algorithm after flattering and removing local variables (they are not needed)

public static bool SearchByteByByte(byte[] bytes, byte[] pattern)
{
    for (int i = 0; i < bytes.Length; i++)
    {
        if (bytes.Length - i < pattern.Length)
            return false;

        if (pattern[0] != bytes[i])
            continue;

        for (int j = 0; j < pattern.Length; j++)
        {
            if (bytes[i + j] != pattern[j])
                break;                    

            if (j == pattern.Length - 1)
                return true;
        }
    }

    return false;
}

2 Comments

Btw you need to check pattern array length before getting first item of array.
No problem. If you have any questions about last optimization, ask them :)
0

So you're looking, effectively, for the longest common substring, so see the Wikipedia article on that: http://en.wikipedia.org/wiki/Longest_common_substring_problem

... or even a reference implementation: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#C.23 -- you will, of course, have to substitute byte[] for string there, etc.

2 Comments

I tried Longest common substring algorithm for strings, but it is slow. I think that byte algoritm will be faster. In my algorithm i must work with bytes.
Strings in the classical definition (eschewing Unicode, with only 8-bit "characters") are just byte arrays anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.