Reading a "string in little-endian UTF-16 encoding" with BinaryReader

Question

I am following this specification of this file format: https://github.com/rouault/dump_gdbtable/wiki/FGDB-Spec

utf16: string in little-endian UTF-16 encoding

How do I read this? I tried BinaryReader.ReadString() however it returns something along the lines of:

"\0e\0y\0w\0o\0r\0d\0\0 \0\0\0\0\rP\0a\0r\0a\0m\0e\0t\0e\0r\0N\0a\0m\0e\0\0 \0\0\0\0\fC\0o\0n\0f\0i\0g\0S\0t\0r\0"

That definitely isn't right.

From the specification:

ubyte: number of UTF-16 characters (not bytes) of the name of the field
utf16: name of the field
ubyte: number of UTF-16 characters (not bytes) of the alias of the field. Might be 0
utf16: alias of the field (ommitted if previous field is 0)
ubyte: field type ( 0 = int16, 1 = int32, 2 = float32, 3 = float64, 4 = string, 5 = datetime, 6 = objectid, 7 = geometry, 8 = binary, 9=raster, 10/11 = UUID, 12 = XML )

Could I somehow use the number of UTF-16 characters to read the name of the field?

How do you construct the BinaryReader? Are you using an overload where you specify the encoding of the text? — Damien_The_Unbeliever
– Damien_The_Unbeliever, Commented Aug 1, 2014 at 14:20
Normally you specify encoding, but on this page there are no little endian utf-16, perhaps you have to make own encoding somehow (or one of them is what you need, not sure). — Sinatr
– Sinatr, Commented Aug 1, 2014 at 14:23
BinaryReader br = new BinaryReader(File.Open("C:\\florida.gdb\\a00000002.gdbtable", FileMode.Open, FileAccess.Read, FileShare.Read | FileShare.Delete)); — Evan Parsons
– Evan Parsons, Commented Aug 1, 2014 at 14:25
@Sinatr - there is such an encoding. It helps to know that in the Windows world, Unicode means UTF-16. — Damien_The_Unbeliever
– Damien_The_Unbeliever, Commented Aug 1, 2014 at 14:28

ulrichb · Accepted Answer · 2014-08-01 14:46:47Z

3

BinaryReaders ReadString() method doesn't provide an overload where you can specify the string length (instead it assumes an encoded prefixed length, which doesn't match the format of the spec you linked).

Therefore, you cannot use ReadString() directly, but you can

use ReadByte() to get the string (character) length,
multiply it by 2,
use ReadBytes(count),
use Encoding.Unicode.GetString(bytes).

edited Aug 1, 2014 at 14:46

answered Aug 1, 2014 at 14:37

ulrichb

20.1k8 gold badges75 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Evan Parsons Over a year ago

Is multiplying by two necessary? When I do it, it returns something similar to the below answer, except more chinese/japanese characters after it: code sample bit = int count = (br.ReadByte() * 2) ; byte[] array = br.ReadBytes(count); field.nameOfField = Encoding.Unicode.GetString(array);

CSharpie Over a year ago

Spec says number of charachters, not bytes. Since Encoding.Unicode is 16 bits (2bytes per char) you want to multiply with 2. You might want to provide code in your question how you try to read the string.

Evan Parsons Over a year ago

aha! I think that's it! It returns "Keyword" which I believe is the name of the field.

Damien_The_Unbeliever · Accepted Answer · 2014-08-01 14:29:14Z

1

It should be:

BinaryReader br = new BinaryReader(File.Open("C:\\florida.gdb\\a00000002.gdbtable",
                                   FileMode.Open,
                                   FileAccess.Read,
                                   FileShare.Read | FileShare.Delete),
                      Encoding.Unicode);

Where Encoding is System.Text.Encoding.

For various historical reasons, Microsoft/Windows refer to UTF-16 (and, specifically, the little-endian variant) as "Unicode" rather than UTF-16.

answered Aug 1, 2014 at 14:29

Damien_The_Unbeliever

241k28 gold badges358 silver badges470 bronze badges

3 Comments

Evan Parsons Over a year ago

It returns "攀礀眀漀爀搀\0 \0ЀഀParameterNameЀ \0䌌漀渀昀椀最匀琀爀" when I switch it to your coding. Would I have to strip out the other characters? I'd do that, but I'm afraid of losing them when I go to save it again.

Lasse V. Karlsen Over a year ago

If you get that in return something is almost certainly wrong.

CSharpie Over a year ago

The Fileformat doesnt work like this! You have to read the bytes at the specific Offset and then interpret them as unicode.

Collectives™ on Stack Overflow

Reading a "string in little-endian UTF-16 encoding" with BinaryReader

2 Answers 2

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related