6

I've got a very long array of bytes, for example:

Byte[] bytes = {90, 80, 63, 65, 70 ...};

It's nearly 20-30 Mb (theoretically). Is there a fast way to check if this array contains another array, for example:

Byte[] small = {63, 80, 75, 77};

First, I need find bytes in that order which they was defined in small array. Second, I need find array in another array not any byte of small array. Thanks all to advance.

6
  • 1
    Can you be more specific? Do you mean the exact sequence or those numbers, in any order in any place? Commented Jan 26, 2016 at 7:27
  • 1
    Possible duplicate of Check whether an array is a subset of another Commented Jan 26, 2016 at 7:27
  • Are you looking for the sequence, i.e. that those four bytes must occur in that order? If so, this is generally known as substring search (even when it doesn't involve strings in whatever language you're working in), and you should be able to look up quite a few algorithms for it. Commented Jan 26, 2016 at 7:27
  • Do you want them to show up in the same order? If order doesn't matter, then you can probably sort your initial array and then binary search for candidates. If the order matters, then there are various algorithms that can help you out. Off the top of my head Aho Corasick might be the fastest in your case. Commented Jan 26, 2016 at 7:29
  • You can try IsSubsetOf method as explained in this answer. Commented Jan 26, 2016 at 7:29

6 Answers 6

5

For performance you'll want something like the Boyer-Moore string search algorithm. While it's designed for strings, it should work just as well on byte arrays, and is much more performant than a brute-force solution.

The Wikipedia article provides several implementations, including one in Java and one in C, so creating a C# implementation should be fairly painless.


As it turns out, translating the Wikipedia article's Java implementation of Boyer-Moore to C# (and char to byte) was painless indeed. Here it is:

public static class BoyerMoore
{
    public static int IndexOf(byte[] haystack, byte[] needle)
    {
        if (needle.Length == 0)
        {
            return 0;
        }

        int[] charTable = MakeCharTable(needle);
        int[] offsetTable = MakeOffsetTable(needle);
        for (int i = needle.Length - 1; i < haystack.Length;)
        {
            int j;
            for (j = needle.Length - 1; needle[j] == haystack[i]; --i, --j)
            {
                if (j == 0)
                {
                    return i;
                }
            }

            i += Math.Max(offsetTable[needle.Length - 1 - j], charTable[haystack[i]]);
        }

        return -1;
    }

    private static int[] MakeCharTable(byte[] needle)
    {
        const int ALPHABET_SIZE = 256;
        int[] table = new int[ALPHABET_SIZE];
        for (int i = 0; i < table.Length; ++i)
        {
            table[i] = needle.Length;
        }

        for (int i = 0; i < needle.Length - 1; ++i)
        {
            table[needle[i]] = needle.Length - 1 - i;
        }

        return table;
    }

    private static int[] MakeOffsetTable(byte[] needle)
    {
        int[] table = new int[needle.Length];
        int lastPrefixPosition = needle.Length;
        for (int i = needle.Length - 1; i >= 0; --i)
        {
            if (IsPrefix(needle, i + 1))
            {
                lastPrefixPosition = i + 1;
            }

            table[needle.Length - 1 - i] = lastPrefixPosition - i + needle.Length - 1;
        }

        for (int i = 0; i < needle.Length - 1; ++i)
        {
            int slen = SuffixLength(needle, i);
            table[slen] = needle.Length - 1 - i + slen;
        }

        return table;
    }

    private static bool IsPrefix(byte[] needle, int p)
    {
        for (int i = p, j = 0; i < needle.Length; ++i, ++j)
        {
            if (needle[i] != needle[j])
            {
                return false;
            }
        }

        return true;
    }

    private static int SuffixLength(byte[] needle, int p)
    {
        int len = 0;
        for (int i = p, j = needle.Length - 1; i >= 0 && needle[i] == needle[j]; --i, --j)
        {
            len += 1;
        }

        return len;
    }
}

The algorithm spends a linear bit of time at the start, building its tables; from then it's blazingly fast.

Sign up to request clarification or add additional context in comments.

Comments

3
static int search(byte[] haystack, byte[] needle)
{
    for (int i = 0; i <= haystack.Length - needle.Length; i++)
    {
        if (match(haystack, needle, i))
        {
            return i;
        }
    }
    return -1;
}

static bool match(byte[] haystack, byte[] needle, int start)
{
    if (needle.Length + start > haystack.Length)
    {
        return false;
    }
    else
    {
        for (int i = 0; i < needle.Length; i++)
        {
            if (needle[i] != haystack[i + start])
            {
                return false;
            }
        }
        return true;
    }
}

Comments

3

Use this:

bool isSubset = !t2.Except(t1).Any();

It's from @Farhad Jabiyev's Link: Check whether an array is a subset of another

2 Comments

The OP isn't treating the arrays as sets but as sequences.
It does not solve a problem. It does not locate small array in big array in that order which bytes was defined in small array.
0

If you've got millions of elements in bytes, I'd suggest:

  1. sort bytes
  2. foreach in small if cannot find element return false

Thus

bytes.Sort(); // only need to do this once.
bool smallContained = ContainsAll(bytes, small);

and

static bool ContainsAll(int[] src, int [] subset)
{ 
    foreach(var i in subset)
       if (src.BinarySearch(i) < 0)
               return false;
    return true;
}

1 Comment

I need find bytes in that order which they was defined in small array.
0

If I understand correctly you want to say if small is a subsequence of bytes. You can find it with a loop over bytes. It should run very fast thanks to processor caching.

  for (int i = 0, index = 0; i < bytes.Length; ++i) 
    if (bytes[i] == small[index]) {
      if (++index >= small.Length) {
        return true; 
      }
    }
  return false;

3 Comments

@PetterHesselberg Do you mean it doesn't correctly detect subsequences or that it doesn't address the confusing question at the top?
It doesn't even compile as-is; the code contains one opening brace and two closing braces. But that's a quibble, easily fixed. The problem is that it finds false positives -- if the bytes of small exist in bytes, but with other values in-between, your algorithm still returns true. I do believe the OP wants a consecutive byte sequence.
@PetterHesselberg Thanks for finding the typo. Those are not false positives, that's the definition of a subsequence. They OP said "First, I need find bytes in that order which they was defined in small array" and I understand that as subsequence. The code above delivers that. I can't understand the second one. Hopefully the OP comes back and clarifies.
0

You can use this function, from this Reddit post:

public static bool CheckSequence<T>(IEnumerable<T> containingArray, IEnumerable<T> containedArray)
{
    bool result = false;
    for (int i = 0; i <= containingArray.Count(); i++)
    {
        if (containedArray.SequenceEqual(containingArray.Skip(i).Take(containedArray.Count())))
            result = true;
    }
    return result;
}

Like:

var result = CheckSequence(bytes, small);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.