EDIT: I've come up with a solution, here it is for anyone else who may want it. It may be updated in the future if a bug is found or other improvements are added. Last updated on 7/18/2015.
/// <summary>
/// Decodes a string from the specified bytes in the specified encoding.
/// </summary>
/// <param name="Length">Specify -1 to read until null, otherwise, specify the amount of bytes that make up the string.</param>
public static string GetString(byte[] Source, int Offset, int Length, Encoding Encoding)
{
if (Length == 0) return string.Empty;
var sb = new StringBuilder();
if (Length <= -1)
{
using (var sr = new StreamReader(new MemoryStream(Source, Offset, Source.Length - Offset), Encoding, false))
{
int ch;
while (true)
{
ch = sr.Read();
if (ch <= 0) break;
sb.Append((char)ch);
}
if (ch == -1) throw new Exception("End of stream reached; null terminator not found.");
return sb.ToString();
}
}
else return Encoding.GetString(Source, Offset, Length);
}
I am upgrading my application's internal string/Encoding code and I've run into a little implementation issue.
Basically, I wanted to make an easy method, ReadNullTerminatedString. It wasn't too hard to make at first. I used Encoding.IsSingleByte to determine a single character's length, would read the byte(s), check for 0s, and stop reading/continue based on the result.
This is where it gets tricky. UTF8 has variable length encoding. Encoding.IsSingleByte returns false, but that is not always correct since it's a variable encoding and a character can be 1 byte, so my implementation based on Encoding.IsSingleByte wouldn't work for UTF8.
At that point I wasn't sure if that method could be corrected, so I had another idea. Just use the encoding's GetString method on the bytes, use the maximum length the string can be for the count param, and then trim the zeros off the returned string.
That too has a caveat. I have to consider cases where my managed applications will be interacting with byte arrays returned from unmanaged code, cases where there will be a null terminator, of course, but the possibility of having extra junk characters after it. For example: "blah\0\0\oldstring"
ReadNullTerminatedString would be the ideal solution in that case, but at the moment it can't be if I want it to support UTF8. The second solution also will not work - it will trim the 0s, but the junk will remain.
Any ideas for an elegant solution for C#?