3

This Perl binary regex found at http://www.w3.org/International/questions/qa-forms-utf-8.en.php matches UTF-8 documents without the UTF-8 BOM header:

$field =~
m/\A(
 [\x09\x0A\x0D\x20-\x7E]            # ASCII
 | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
 |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
 | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
 |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
 |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
 | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
 |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*\z/x;

I need this because I am working on a PowerShell equivalent to 'grep -I', and part of this involves detecting text encoding.

But how do I rewrite this in C# or PowerShell? Or in other words, in ".Net Regex" syntax?

EDIT: Found this http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 question about the same Regex of all things. The short answer seems like this can not be done with .Net since .Net does not support binary regular expressions.

1
  • This is a very simple regex. Could you explain what specific problem you have converting this? Commented Jul 8, 2009 at 20:24

4 Answers 4

1

This post at http://social.msdn.microsoft.com/Forums/en-US/regexp/thread/6a81be63-e6da-4156-a5bf-8b9782a1ac40 describes several workarounds.

Sign up to request clarification or add additional context in comments.

Comments

1

The odds are pretty good that if a sequence has no invalid UTF-8 characters, it can be treated as UTF-8. Since RegExps are for text in .Net, not byte arrays, here's a non-regexp solution that should work. Personally, I'd rather use this as a fallback mechanism (e.g. mycommand -autodetect) and offer pipeline parameters that allow user-specified encodings.

       string result=String.Empty;
        Encoding ae = Encoding.GetEncoding(
              Encoding.UTF8.EncodingName,
              new EncoderExceptionFallback(), 
              new DecoderExceptionFallback());
        try {
            result=ae.GetString(mybytes);
        }
        catch (DecoderFallbackException e)
        {
            //revert to some sensible default. Maybe the Ansi Code page for this environment?
            // This will use the substitution fallback mechanism, which usually replaces unknown characters with question marks.
            result=Encoding.Default.GetString(mybytes);
        }

If you can interact with unmanaged code, research the MLANG dll that ships with IE. It has alternate encoding autodetection methods that may be more useful.

Comments

1

Try this: (I haven't checked that it matches correctly; you can easily try it in LINQPad).

new Regex(@"
    ^(
    [\x09\x0A\x0D\x20-\x7E]            # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$", RegexOptions.IgnorePatternWhitespace)

EDIT:

Try reading your file using an ASCII StreamReader; that should do what you're looking for. (Note that I didn't actually try it)

2 Comments

The Perl regex is a binary regex. So this will not work. After more research it doesn's seem that .Net supports binary regular expressions.
You can fake "binary" regex matching by decoding the byte stream in such a way that each byte is converted to a character with the same numeric value. Just use ISO-8859-1.
0

What specifically are you trying to do?

You should be able to use the System.Text.Encoding class.

2 Comments

I don't see how to detect the encoding of a binary stream using this class. The regular expression in the question matches true if the binary stream is UTF-8 encoded.
kervin: You can try parsing the stream as UTF-8. If it fails, then it wasn't UTF-8, otherwise it was.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.