Search longest pattern in byte array in C#

Question

I need to write effective and quick method to search byte array for given pattern. I write it this way, what do you think , how to improve? And it has one bug, it cannot return match with length 1.

public static bool SearchByteByByte(byte[] bytes, byte[] pattern)
    {
        bool found = false;
        int matchedBytes = 0;
        for (int i = 0; i < bytes.Length; i++)
        {
            if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
            {
                for (int j = 1; j < pattern.Length; j++)
                {
                    if (bytes[i + j] == pattern[j])
                    {
                        matchedBytes++;
                        if (matchedBytes == pattern.Length - 1)
                        {
                            return true;
                        }
                        continue;
                    }
                    else
                    {
                        matchedBytes = 0;
                        break;
                    }
                }
            }
        }
        return found;
    }

Any suggestions ?

You might look at string search algorithms en.wikipedia.org/wiki/String_searching_algorithm — mdakin
– mdakin, Commented Mar 27, 2012 at 12:21
It would be nice to include the idea behind your algorithm rather than pasting your code without explanation. — ntziolis
– ntziolis, Commented Mar 27, 2012 at 12:23
Would converting your two bytes arrays to two comma-separated-Hexa-strings and do a regex matching be effective enough ? — jbl
– jbl, Commented Mar 27, 2012 at 13:58

markmuetz · Accepted Answer · 2012-03-27 19:19:12Z

8

The Boyer-Moore algorithm that is used in grep is pretty efficient, and gets more efficient for longer pattern sizes. I'm pretty sure you could make it work for a byte array without too much difficulty, and its wikipedia page has an implementation in Java that should be fairly easy to port to C#.

UPDATE:

Here's an implementation of a simplified version of the Boyer-Moore algorithm for byte arrays in C#. It only uses the second jump table of the full algorithm. Based on the array sizes that you said (haystack: 2000000 bytes, needle: 10 bytes), it's about 5-8 times faster than a simple byte by byte algorithm.

    static int SimpleBoyerMooreSearch(byte[] haystack, byte[] needle)
    {
        int[] lookup = new int[256];
        for (int i = 0; i < lookup.Length; i++) { lookup[i] = needle.Length; }

        for (int i = 0; i < needle.Length; i++)
        {
            lookup[needle[i]] = needle.Length - i - 1;
        }

        int index = needle.Length - 1;
        var lastByte = needle.Last();
        while (index < haystack.Length)
        {
            var checkByte = haystack[index];
            if (haystack[index] == lastByte)
            {
                bool found = true;
                for (int j = needle.Length - 2; j >= 0; j--)
                {
                    if (haystack[index - needle.Length + j + 1] != needle[j])
                    {
                        found = false;
                        break;
                    }
                }

                if (found)
                    return index - needle.Length + 1;
                else
                    index++;
            }
            else
            {
                index += lookup[checkByte];
            }
        }
        return -1;
    }

edited Mar 27, 2012 at 19:19

answered Mar 27, 2012 at 13:08

markmuetz

9,6542 gold badges34 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Hlavson Over a year ago

I have read, that The Boyer-Moore algorithm is not so fast on big alphabet. Byte alphabet is 256. It will be fast?

markmuetz Over a year ago

To be honest, I'm not sure. You can try benchmarking it but I'm afraid I can't tell you without benchmarking it myself. Out of interest, how long is your pattern?

Hlavson Over a year ago

About 10 bytes, but array where i need to search is about 2 000 000 items long.

Hlavson Over a year ago

Could be this code modified with one's more input parameter? Index of haystack from where to begin search? I know i can copy array and remove first item, but in large array it's slow, so start search from given index will be better.

markmuetz Over a year ago

Yes, very easily. Just add an int index argument to the method, so as its signature looks like this: static int SimpleBoyerMooreSearch(byte[] haystack, byte[] needle, int index), and delete the int index assignment in the method body. Then increase the index by needle.Length - 1, which is necessary due to the way the algorithm works (i.e. starting at the end of the needle and working back).

Sergey Berezovskiy · Accepted Answer · 2012-03-27 12:39:06Z

2

And it has one bug, it cannot return match with length 1

To fix this, start inner loop from zero:

public static bool SearchByteByByte(byte[] bytes, byte[] pattern)
{
    bool found = false;
    int matchedBytes = 0;
    for (int i = 0; i < bytes.Length; i++)
    {

        if (pattern[0] == bytes[i] && bytes.Length - i >= pattern.Length)
        {

            for (int j = 0; j < pattern.Length; j++) // start from 0
            {
                if (bytes[i + j] == pattern[j]) 
                {
                    matchedBytes++;
                    if (matchedBytes == pattern.Length) // remove - 1                    
                        return true;

                    continue;
                }
                else
                {
                    matchedBytes = 0;
                    break;
                }
            }
        }
    }
    return found;
}

UPDATE: Here is your searching algorithm after flattering and removing local variables (they are not needed)

public static bool SearchByteByByte(byte[] bytes, byte[] pattern)
{
    for (int i = 0; i < bytes.Length; i++)
    {
        if (bytes.Length - i < pattern.Length)
            return false;

        if (pattern[0] != bytes[i])
            continue;

        for (int j = 0; j < pattern.Length; j++)
        {
            if (bytes[i + j] != pattern[j])
                break;                    

            if (j == pattern.Length - 1)
                return true;
        }
    }

    return false;
}

edited Mar 27, 2012 at 12:39

answered Mar 27, 2012 at 12:26

Sergey Berezovskiy

237k44 gold badges441 silver badges468 bronze badges

2 Comments

Sergey Berezovskiy Over a year ago

Btw you need to check pattern array length before getting first item of array.

Sergey Berezovskiy Over a year ago

No problem. If you have any questions about last optimization, ask them :)

AKX · Accepted Answer · 2012-03-27 12:37:28Z

0

So you're looking, effectively, for the longest common substring, so see the Wikipedia article on that: http://en.wikipedia.org/wiki/Longest_common_substring_problem

... or even a reference implementation: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Longest_common_substring#C.23 -- you will, of course, have to substitute byte[] for string there, etc.

answered Mar 27, 2012 at 12:37

AKX

171k16 gold badges146 silver badges228 bronze badges

2 Comments

Hlavson Over a year ago

I tried Longest common substring algorithm for strings, but it is slow. I think that byte algoritm will be faster. In my algorithm i must work with bytes.

AKX Over a year ago

Strings in the classical definition (eschewing Unicode, with only 8-bit "characters") are just byte arrays anyway.

Collectives™ on Stack Overflow

Search longest pattern in byte array in C#

3 Answers 3

5 Comments

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related