5

When writing a string to a binary file using C#, the length (in bytes) is automatically prepended to the output. According to the MSDN documentation this is an unsigned integer, but is also a single byte. The example they give is that a single UTF-8 character would be three written bytes: 1 size byte and 2 bytes for the character. This is fine for strings up to length 255, and matches with the behaviour I've observed.

However, if your string is longer than 255 bytes, the size of the unsigned integer grows as necessary. As a simple example, consider 1024 characters as:

string header = "ABCDEFGHIJKLMNOP";
for (int ii = 0; ii < 63; ii++)
{
  header += "ABCDEFGHIJKLMNOP";
}
fileObject.Write(header);

results in 2-bytes prepending the string. Creating a 2^17 length string results in a somewhat maddening 3-byte array.

The question, therefore, is how to know how many bytes to read to get the size of what follows when reading? I wouldn't necessarily know a priori the header size. Ultimately, can I force the Write(string) method to always use a consistent size (say 2 bytes)?

A possible workaround is to write my own write(string) method, but I would like to avoid that for obvious reasons (similar questions here and here accept this as an answer). Another more palatable workaround is to have the reader look for a specific character that starts the ASCII string information (maybe an unprintable character?), but that is not infallible. A final workaround (that I can think of) would be to force the string to be within the range of sizes for a particular number of size bytes; again, that is non ideal.

While forcing the size of the byte array to be consistent is the easiest, I have control over the reader so any clever reader solutions are also welcome.

13
  • 1
    It uses a variable-length 7-bit encoding. A micro-optimization, very little reason to be mad about it. If you don't like it then consider Encoding.UTF8.GetBytes() but don't forget to also serialize the length of the byte[] array so you can properly read it back. Don't use 7-bit encoding, hehe. Commented Nov 21, 2017 at 9:27
  • 1
    Are you SURE strings of length between 128 and 255 are actually storing the length as a single byte? Commented Nov 21, 2017 at 9:30
  • 1
    @MatthewWatson I'm sure that they aren't :) Commented Nov 21, 2017 at 9:30
  • 1
    @AndyK. in that encoding each byte has information about if there is another byte (that's why it's 7-bit encoding - last bit is used for that). So you read 1 byte, then check that bit and decide if you need to read next byte or not. That means you can always read string length, even though that length is encoded in variable-length array. Commented Nov 21, 2017 at 9:36
  • 1
    @AndyK. here's the reference source for Read7BitEncodedInt: referencesource.microsoft.com/#mscorlib/system/io/… Commented Nov 21, 2017 at 9:42

2 Answers 2

3

BinaryWriter and BinaryReader aren't the only way of writing binary data; simply: they provide a convention that is shared between that specific reader and writer. No, you can't tell them to use another convention - unless of course you subclass both of them and override the ReadString and Write(string) methods completely.

If you want to use a different convention, then simply: don't use BinaryReader and BinaryWriter. It is pretty easy to talk to a Stream directly using any text Encoding you want to get hold of the bytes and the byte count. Then you can use whatever convention you want. If you only ever need to write strings up to 65k then sure: use fixed 2 bytes (unsigned short). You'll also need to decide which byte comes first, of course (the "endianness").

As for the size of the prefix: it is essentially using:

int byteCount = this._encoding.GetByteCount(value);
this.Write7BitEncodedInt(byteCount);

with:

protected void Write7BitEncodedInt(int value)
{
    uint num = (uint) value;
    while (num >= 0x80)
    {
        this.Write((byte) (num | 0x80));
        num = num >> 7;
    }
    this.Write((byte) num);
}

This type of encoding of lengths is pretty common - it is the same idea as the "varint" that "protobuf" uses, for example (base-128, least significant group first, retaining bit order in 7-bit groups, 8th bit as continuation)

Sign up to request clarification or add additional context in comments.

3 Comments

Referring to them as a convention makes a lot of sense, and therefore why you cannot change them to fit your needs without overriding the methods. Your comment above strengthens that point, and is a paradigm shift on how I think about them.
@AndyK. to be honest, it sounds like you should be dealing with Stream directly...
I am writing human-readable header information to a data file, and it was just so tempting to use a very simple write(string) method which, on the surface, did everything I wanted. I think you're right.
2

If you want to write the length yourself:

using (var bw = new BinaryWriter(fs))
{
  bw.Write(length); // Use a byte, a short...
  bw.Write(Encoding.Unicode.GetBytes("Your string"));
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.