Encoding and null terminated strings

Question

EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.

    /// <summary>
    /// Decodes a string from the specified bytes in the specified encoding.
    /// </summary>
    /// <param name="Length">Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.</param>
    public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
    {
        if (Length == 0) return string.Empty;
        var sb = new StringBuilder();
        if (Length <= -1)
        {
            using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
            {
                int ch;
                while (true)
                {
                    ch = sr.Read();
                    if (ch <= 0) break;
                    sb.Append((char)ch);
                }
                if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
                return sb.ToString();
            }
        }
        else return Encoding.GetString(Source, Offset, Length);
    }

I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.

Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.

This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.

At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.

That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it. For example: "blah\0\0\oldstring"

ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.

Any ideas for an elegant solution for C#?

Have you looked at this post: stackoverflow.com/questions/11713878/… — David Tansey
– David Tansey, Commented Jul 17, 2015 at 19:41
Hi, yes, but it doesn't really apply here much since my method is intended to support any encoding. — Eaton
– Eaton, Commented Jul 17, 2015 at 19:45

Ian Boyd · Accepted Answer · 2016-01-15 01:40:01Z

1

Your best solution is to use an implementation of TextReader:

StreamReader if you're reading from a stream
StringReader if you're reading from a string

With this you can read your source stream of bytes, in whatever encoding you like, and each "character" will come back to you as an int:

int ch = reader.Read();

Internally the magic is done through the C# Decoder class (which comes from your Encoding):

var decoder = Encoding.UTF7.GetDecoder();

The Decoder class needs a short array buffer. Fortunately StreamReader knows how to keep the buffer filled and everything work.

Pseudocode

Untried, untested, and only happens to look like C#:

String ReadNullTerminatedString(Stream stm, Encoding encoding)
{
   StringBuilder sb = new StringBuilder();

   TextReader rdr = new StreamReader(stm, encoding);
   int ch = rdr.Read(); 
   while (ch > 0) //returns -1 when we've hit the end, and 0 is null
   {
      sb.AppendChar(Char(ch));
      int ch = rdr.Read();
   }
   return sb.ToString();
}

Note: Any code released into public domain. No attribution required.

edited Jan 15, 2016 at 1:40

answered Jul 17, 2015 at 21:06

Ian Boyd

259k271 gold badges920 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Eaton Over a year ago

Thank you, Ian. Forgot I could use StreamReader for this. I managed to come up with my ideal solution. I'll add it to my main post so others can reference it if needed.

Collectives™ on Stack Overflow

Encoding and null terminated strings

1 Answer 1

Pseudocode

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Pseudocode

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related