1

I know that I can use regex to match substrings in a string, but is it possible to match some patterns in binary data using regex? If so then in what format should the binary data be - binary array, stream, or something else?

edit:

well to explain i have binary data that shouldnt have some strings inside but the data itself is binary so i need to detect this pattern of data so i mark this data as invalid.
but i couldnt convert this binary data to string since it would be invalid. maybe only to some char[] or something.

edit:

now i am thinking maybe converting the binary data to a basic encoding (any hints on which is the most basic encoding available? certainly not unicode, i think ascii?) and then i will use regex.
but the question would i be able to convert any binary data to string using this encoding or i will encounter some cases which will be invalid and will cause exceptions when converting the binary data to string.

2
  • from Wikipedia: regular expressions provide a concise and flexible means for matching strings of text. Commented Oct 9, 2010 at 19:15
  • 3
    @splash: Wikipedia is wrong. Fundamentally, regular expressions are not about text but about formal languages, a totally abstract concept. You can obviously apply them to natual language text, but also to a lot of other things, as long as they are regular enough. Commented Oct 9, 2010 at 20:04

5 Answers 5

3

The technical answer to your question is yes, since you could just treat the binary data as a string of a particular encoding, but I don't believe that's what you're asking.

If you're asking if there's a library designed to do pattern matching on an array of bytes, then the .NET regex system will not do this and there isn't such a library that I'm aware of.

Sign up to request clarification or add additional context in comments.

14 Comments

i dont want to treat this data as string. but is there any other method i can use to achieve this without using regex?
@Karim: Not that I am aware of, but there are plenty of tutorials and explanations online about writing your own implementation of regular expressions. I wouldn't imagine it would be incredibly difficult to adapt one of these to work on binary data rather than text.
@Karim: If you're going to use string manipulation, I would suggest ASCII, as it's a single-byte encoding (so you'll get one character for every byte). Using a variable-byte encoding can/will result in the length of the string not necessarily corresponding to the length of your byte array.
Are you sure? msdn.microsoft.com/en-us/library/system.text.asciiencoding.aspx says otherwise: "Since ASCIIEncoding supports only the Unicode character values between U+0000 and U+007F (...)"
@Joren: Interesting, I was always under the impression that ASCIIEncoding took advantage of the extended ASCII character set, but evidently not. In order to use that, you'll probably need to use the Western European ISO standard. Instead of using Encoding.ASCII, use Encoding.GetEncoding(28591). The various unicode sets will not work for this technique, as you can end up having the decoder interpret multiple bytes as a single character.
|
2

Yes it is possible but why would you want to? You would need to encode the data as a string first of course but if you are going to go to that trouble why don't you simply deserialize the data into a more sensible data structure?

Regular expressions are for matching strings only - if have binary data then you can be quite sure that a regex is the wrong solution to your problem.

1 Comment

well the binary data i have can contain string but mostly its binary. i just need to detect some string patterns that will mark the data as invalid.
0

I haven't tried this, but I'll bet you could convert your binary data to a base64 string, then use a regex to find your search string - of course, you would have to encode your search string in base64 as well.

3 Comments

i dont think this will work. because the byte offsets will be different in base64 string, i mean the data wont be byte (8 bits) alligned but instead 6 bits aligned
I thought if you converted the search string as well it might work out - but now that you mention it, I can see that it would work maybe only 1 out of 3 times, when the relevant bytes just happen to line up right. Oh well, sounded good when I wrote it... :(
well thanks for your input. all ideas are good since they can lead to other ideas even if the first idea dont work first. :)
0

Perl are able to match against pure binary data. And I think this should be possible with most of the other brands.

You can use the abbreviation '\xNN' to search for a particular byte in its hexadecimal form. So even char-classes like '[\x20-\xff]' are possible.

Comments

0

It's possible to match byte[] against the pattern. Converting string to byte[] loses character but not length. So, using this formula I obtained Index and Value.Length from Match and then sub-array the byte[] in this extension.

using System.Diagnostics.CodeAnalysis;
using System.Text;
using System.Text.RegularExpressions;

public static class RegexByteArrayMatcher
{
    public static List<byte[]> Matches(this byte[] input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions regexOptions = RegexOptions.None) =>
        Regex.Matches(Encoding.UTF8.GetString(input), pattern, regexOptions).ToArray().Select(m => input.Skip(m.Index).Take(m.Length).ToArray()).ToList();
    public static List<byte[]> Matches(this byte[] input, [StringSyntax(StringSyntaxAttribute.Regex)] byte[] pattern) => input.Matches(Encoding.UTF8.GetString(pattern));
    public static bool IsMatch(this byte[] input, [StringSyntax(StringSyntaxAttribute.Regex)] string pattern, RegexOptions regexOptions = RegexOptions.None) =>
        Regex.IsMatch(Encoding.UTF8.GetString(input), pattern, regexOptions);
    public static bool IsMatch(this byte[] input, [StringSyntax(StringSyntaxAttribute.Regex)] byte[] pattern) => input.IsMatch(Encoding.UTF8.GetString(pattern));
    public static List<byte[]> Matches(this Regex regex, byte[] input) =>
        regex.Matches(Encoding.UTF8.GetString(input)).ToArray().Select(m => input.Skip(m.Index).Take(m.Length).ToArray()).ToList();
    public static bool IsMatch(this Regex regex, byte[] input) => regex.IsMatch(Encoding.UTF8.GetString(input));
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.