0

I have a byte array that I need to store into a nvarchar DB column. A nvarchar takes 2 bytes. What is the optimal encoding?

Ideally I would store N bytes into a nvarchar of lenght N/2, but there is invalid unicode sequences that worries me.

0

3 Answers 3

2

The most optimal solution would be to store binary in a binary column. So you mean the most optimal encoding within the constraints of this suboptimal scenario?

Just go for base64, it's safe.

If you can't control the input bytes, you're bound to running into encoding problems sooner or later.

Sign up to request clarification or add additional context in comments.

3 Comments

Yes, I have the stated constraints. Going base64 is safe, not space-optimal.
"Going base64 is safe, not space-optimal" - or, you can simply use another binary-to-text encoding. For instance, yEnc is more space efficient than base64. But there are other binary-to-text encodings available that you can play with
> But there are other binary-to-text encodings available that you can play with @RemyLebeau That is exactly my question. What other encodings? I didn't know about yEnc. If this was an answer I would accept it.
1

Usually Base64 is a good way, but you may use just Unicode code points.

Unicode codepoints go from 0 to 10FFFF, but you can encode easily and efficiently 2 bytes and an half into a Unicode code point. Depending on your requirements, you may shift all codepoints by 128, so that you have ASCII for boundaries (and you do not need to worry about byte 0, and still you have enough code points for the 20bit binary data (per code point). [Or maybe just escape 0 as 0x10000]

This is generic, for Unicode (so generic Unicode). If you know the encoding (e.g. UTF-8, you may choose different encoding).

1 Comment

I do something similar in my Base524288. But to make sure I only produce valid code points that play well with any surrounding text/code/protocol, I use the 8 unassigned planes 4-11. That gives me 19 bits per code point. All encoded output is displayed as "missing glyphs" in text.
0

Have a look at tables 3-6 and 3-7 in the Unicode spec (version 16):

Table 3-6. UTF-8 Bit Distribution
Scalar Value First Byte Second Byte Third Byte Fourth Byte
00000000 0xxxxxxx 0xxxxxxx
00000yyy yyxxxxxx 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

Table 3-6 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of Unicode scalar values corresponding to one-, two-, three-, and four-byte sequences.

Table 3-7. Well-Formed UTF-8 Byte Sequences
Code Points First Byte Second Byte Third Byte Fourth Byte
U+0000..U+007F 00..7F
U+0080..U+07FF C2..DF 80..BF
U+0800..U+0FFF E0 A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF

Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is well-formed in that position. Any byte value outside of the ranges listed is ill-formed.

In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.

As long as you stay within those limitations, I believe you should be fine. If you only use it to store binary data that won't be displayed or exchanged with other systems, you don't have to worry about noncharacters, control characters and just weird characters that can mess things up. They would still be valid if you happened to produce them.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.