Encoding binary into unicode

Question

I have a byte array that I need to store into a nvarchar DB column. A nvarchar takes 2 bytes. What is the optimal encoding?

Ideally I would store N bytes into a nvarchar of lenght N/2, but there is invalid unicode sequences that worries me.

CodeCaster · Accepted Answer · 2019-10-01 13:12:28Z

2

The most optimal solution would be to store binary in a binary column. So you mean the most optimal encoding within the constraints of this suboptimal scenario?

Just go for base64, it's safe.

If you can't control the input bytes, you're bound to running into encoding problems sooner or later.

answered Oct 1, 2019 at 13:12

CodeCaster

153k24 gold badges237 silver badges287 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

milan Over a year ago

Yes, I have the stated constraints. Going base64 is safe, not space-optimal.

Remy Lebeau Over a year ago

"Going base64 is safe, not space-optimal" - or, you can simply use another binary-to-text encoding. For instance, yEnc is more space efficient than base64. But there are other binary-to-text encodings available that you can play with

milan Over a year ago

> But there are other binary-to-text encodings available that you can play with @RemyLebeau That is exactly my question. What other encodings? I didn't know about yEnc. If this was an answer I would accept it.

Giacomo Catenazzi · Accepted Answer · 2019-10-01 14:43:24Z

1

Usually Base64 is a good way, but you may use just Unicode code points.

Unicode codepoints go from 0 to 10FFFF, but you can encode easily and efficiently 2 bytes and an half into a Unicode code point. Depending on your requirements, you may shift all codepoints by 128, so that you have ASCII for boundaries (and you do not need to worry about byte 0, and still you have enough code points for the 20bit binary data (per code point). [Or maybe just escape 0 as 0x10000]

This is generic, for Unicode (so generic Unicode). If you know the encoding (e.g. UTF-8, you may choose different encoding).

answered Oct 1, 2019 at 14:43

Giacomo Catenazzi

9,6832 gold badges31 silver badges37 bronze badges

1 Comment

ximo Over a year ago

I do something similar in my Base524288. But to make sure I only produce valid code points that play well with any surrounding text/code/protocol, I use the 8 unassigned planes 4-11. That gives me 19 bits per code point. All encoded output is displayed as "missing glyphs" in text.

ximo · Accepted Answer · 2024-10-29 11:29:48Z

Have a look at tables 3-6 and 3-7 in the Unicode spec (version 16):

Table 3-6. UTF-8 Bit Distribution

Scalar Value	First Byte	Second Byte	Third Byte	Fourth Byte
00000000 0xxxxxxx	0xxxxxxx
00000yyy yyxxxxxx	110yyyyy	10xxxxxx
zzzzyyyy yyxxxxxx	1110zzzz	10yyyyyy	10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx	11110uuu	10uuzzzz	10yyyyyy	10xxxxxx

Table 3-6 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of Unicode scalar values corresponding to one-, two-, three-, and four-byte sequences.

Table 3-7. Well-Formed UTF-8 Byte Sequences

Code Points	First Byte	Second Byte	Third Byte	Fourth Byte
U+0000..U+007F	00..7F
U+0080..U+07FF	C2..DF	80..BF
U+0800..U+0FFF	E0	A0..BF	80..BF
U+1000..U+CFFF	E1..EC	80..BF	80..BF
U+D000..U+D7FF	ED	80..9F	80..BF
U+E000..U+FFFF	EE..EF	80..BF	80..BF
U+10000..U+3FFFF	F0	90..BF	80..BF	80..BF
U+40000..U+FFFFF	F1..F3	80..BF	80..BF	80..BF
U+100000..U+10FFFF	F4	80..8F	80..BF	80..BF

Table 3-7 lists all of the byte sequences that are well-formed in UTF-8. A range of byte values such as A0..BF indicates that any byte from A0 to BF (inclusive) is well-formed in that position. Any byte value outside of the ranges listed is ill-formed.

In Table 3-7, cases where a trailing byte range is not 80..BF are shown in bold italic to draw attention to them. These exceptions to the general pattern occur only in the second byte of a sequence.

As long as you stay within those limitations, I believe you should be fine. If you only use it to store binary data that won't be displayed or exchanged with other systems, you don't have to worry about noncharacters, control characters and just weird characters that can mess things up. They would still be valid if you happened to produce them.

Collectives™ on Stack Overflow

Encoding binary into unicode

3 Answers 3

3 Comments

1 Comment

Table 3-6. UTF-8 Bit Distribution

Table 3-7. Well-Formed UTF-8 Byte Sequences

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Table 3-6. UTF-8 Bit Distribution

Table 3-7. Well-Formed UTF-8 Byte Sequences

Comments

Your Answer

Sign up or log in

Post as a guest

Related