Parsing Unicode in a byte array

Question

I have a byte array with a series of characters. In one case I have

[28] = 0x6e
[29] = 0x61
[30] = 0x6d
[31] = 0x65
[32] = 0x00
[33] = 0x00
[34] = 0x00
[35] = 0x4f
[36] = 0x08
[37] = 0x00
[38] = 0x07
[39] = 0x00
[40] = 0x00
[41] = 0x04
[42] = 0x13
[43] = 0xff
[44] = 0xff
[45] = 0x00
[46] = 0x00

28 to 31 has the characters "name" with that section ending on element 32. Then I have another byte array:

[47] = 0x01
[48] = 0x03
[49] = 0x00
[50] = 0x00
[51] = 0x73
[52] = 0x65
[53] = 0xc3
[54] = 0xb1
[55] = 0x6f
[56] = 0x72
[57] = 0x00
[58] = 0x00
[59] = 0x00
[60] = 0x4f
[61] = 0x08
[62] = 0x00
[63] = 0x08
[64] = 0x00
[65] = 0x00
[66] = 0x04
[67] = 0x13
[68] = 0xff
[69] = 0xff
[70] = 0x00
[71] = 0x00

where I believe I have the string señor.

With the first array it's easy to find the name as the first 4 bytes with 00 as a terminator but how do I decipher whats on the second byte array?

Both arrays are vector<char>s.

What Unicode? UTF16, UTF8? BTW C++11 and later have utf16 literals so you don't need to parse anything. If bytes 28-31 are name, you probably use UTF8. In any case, a UTF8 string in C++ is a std::string — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 20, 2016 at 18:00
@PanagiotisKanavos - how do I know where the name ends in the 2nd byte array? — ruipacheco
– ruipacheco, Commented Dec 20, 2016 at 18:01
Check C++ String and Character Literals for the current state of Unicode support. What do you want to do with this array? Do you want to display it? Work with individual characters? It may be easier to convert it to a char16_t/u16string — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Dec 20, 2016 at 18:03
If you don't know in advance the used UTF[8/16/32] or Windows Codepage encode, the only option is try-error method. You should learn at least the differences between UTF8 and UTF16 and Codepage. — Ripi2
– Ripi2, Commented Dec 20, 2016 at 18:05

Sam Varshavchik · Accepted Answer · 2016-12-20 18:20:22Z

1

The text is obviously using UTF-8 encoding:

[53] = 0xc3
[54] = 0xb1

This is the UTF-8 encoded ñ character. And the surrounding characters are the remaining four characters in señor.

The C++ library does have some facilities for working with UTF-8; but I always found those library classes somewhat awkward and inflexible. On most platforms, you have an excellent, flexible iconv library with a simple, easy API for converting between UTF-8 and other encodings.

answered Dec 20, 2016 at 18:20

Sam Varshavchik

119k6 gold badges109 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ruipacheco Over a year ago

How do I know where the unicode string ends in the byte array?

Sam Varshavchik Over a year ago

By understanding where the strings come from, and what process creates them. There is no sign that individual bytes wear around their neck that say "this is where the string ends". In C-style strings, they get terminated by a '\0' byte. If these are C-style strings, that's how you find their end.

Collectives™ on Stack Overflow

Parsing Unicode in a byte array

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related