3

I have a binary file (i.e., it contains bytes with values between 0x00 and 0xFF). There are also ASCII strings in the file (e.g., "Hello World") that I want to find and edit using Regex. I then need to write out the edited file so that it's exactly the same as the old one but with my ASCII edits having been performed. How?

        byte[] inbytes = File.ReadAllBytes(wfile);
        string instring = utf8.GetString(inbytes);
        // use Regex to find/replace some text within instring
        byte[] outbytes = utf8.GetBytes(instring);
        File.WriteAllBytes(outfile, outbytes);

Even if I don't do any edits, the output file is different from the input file. What's going on, and how can I do what I want?


EDIT: Ok, I'm trying to use the offered suggestion and am having trouble understanding how to actually implement it. Here's my sample code:

        string infile = @"C:\temp\in.dat";
        string outfile = @"C:\temp\out.dat";
        Regex re = new Regex(@"H[a-z]+ W[a-z]+");  // looking for "Hello World"
        byte[] inbytes = File.ReadAllBytes(infile);
        string instring = new SoapHexBinary(inbytes).ToString();
        Match match = re.Match(instring);
        if (match.Success)
        {
            // do work on 'instring'
        }
        File.WriteAllBytes(outfile, SoapHexBinary.Parse(instring).Value);

Obviously, I know I'll not get a match doing it that way, but if I convert my Regex to a string (or whatever), then I can't use Match, etc. Any ideas? Thanks!

12
  • 2
    You haven't stated how the output varies from the input, but if there is binary data in it, I would imagine the output varies considerably. You can't convert a binary file to UTF8 and expect the binary data to pass through unscathed. Commented Oct 18, 2012 at 19:29
  • 3
    Edit a binary file with regex? No don't try to do it. Commented Oct 18, 2012 at 19:32
  • So all you're asking is how to read a binary file as text, change it, and save it to disk, right? Regex is irrelevant. Commented Oct 18, 2012 at 19:36
  • No idea how you'd make it happen in C#, but why couldn't you have a regular expression engine that acts on byte strings rather then character strings? Commented Oct 18, 2012 at 19:37
  • 1
    Justin: Regex is quite relevant because there are numerous ASCII strings that I want to find/replace. Much easier to use regular expressions. Commented Oct 18, 2012 at 19:56

3 Answers 3

2

Not all binary strings are valid UTF-8 strings. When you try to interpret the binary as a UTF-8 string, the bytes that can't be thus interpreted are probably getting mangled. Basically, if the whole file is not encoded text, then interpreting it as encoded text will not yield sensible results.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks. I suspect that "the whole file is not encoded text". So is there no way to read a binary file, edit some ascii strings, and write the new binary file without bytes getting mangled?
I'm sure there is. This is probably not the best way, but it might work to pick some 8-bit encoding that round-trips to arbitrary binary. It looks like Windows-1252 is not such an encoding.
1

An alternative to playing with binary file can be: converting it to hex string, working on it(Regex can be used here) and then saving it back

byte[] buf = File.ReadAllBytes(file);
var str = new SoapHexBinary(buf).ToString();

//str=89504E470D0A1A0A0000000D49484452000000C8000000C808030000009A865EAC00000300504C544......
//Do your work

File.WriteAllBytes(file,SoapHexBinary.Parse(str).Value);

PS: Namespace : System.Runtime.Remoting.Metadata.W3cXsd2001.SoapHexBinary

Comments

-1

I got it! Check out the code:

        string infile = @"C:\temp\in.dat";
        string outfile = @"C:\temp\out.dat";
        Regex re = new Regex(@"H[a-z]+ W[a-z]+");   // looking for "Hello World"
        string repl =  @"Hi there";

        Encoding ascii = Encoding.ASCII;
        byte[] inbytes = File.ReadAllBytes(infile);
        string instr = ascii.GetString(inbytes);
        Match match = re.Match(instr);
        int beg = 0;
        bool replaced = false;
        List<byte> newbytes = new List<byte>();
        while (match.Success)
        {
            replaced = true;
            for (int i = beg; i < match.Index; i++)
                newbytes.Add(inbytes[i]);
            foreach (char c in repl)
                newbytes.Add(Convert.ToByte(c));
            Match nmatch = match.NextMatch();
            int end = (nmatch.Success) ? nmatch.Index : inbytes.Length;
            for (int i = match.Index + match.Length; i < end; i++)
                newbytes.Add(inbytes[i]);
            beg = end;
            match = nmatch;
        }
        if (replaced)
        {
            var newarr = newbytes.ToArray();
            File.WriteAllBytes(outfile, newarr);
        }
        else
        {
            File.WriteAllBytes(outfile, inbytes);
        }

4 Comments

This wouldn't work. The output of this code will be 63,0,63. Encoding ascii = Encoding.ASCII; byte[] inbytes = new byte[] {255,0,255 }; string instr = ascii.GetString(inbytes); var outbytes = ascii.GetBytes(instr);
Barry, you possibly loose bytes>128.
I appreciate the warning, and I can't explain it, but my file has lots of values > 128 (e.g., 81, 9A, AE).
I can't explain it, Barry, Binary data is binary, trying to converting it to string with any encoding will result in loss of some data. I don't think this is the correct way to go. But,anyway, It is your choice.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.