0

I have a byte array with a series of characters. In one case I have

[28] = 0x6e
[29] = 0x61
[30] = 0x6d
[31] = 0x65
[32] = 0x00
[33] = 0x00
[34] = 0x00
[35] = 0x4f
[36] = 0x08
[37] = 0x00
[38] = 0x07
[39] = 0x00
[40] = 0x00
[41] = 0x04
[42] = 0x13
[43] = 0xff
[44] = 0xff
[45] = 0x00
[46] = 0x00

28 to 31 has the characters "name" with that section ending on element 32. Then I have another byte array:

[47] = 0x01
[48] = 0x03
[49] = 0x00
[50] = 0x00
[51] = 0x73
[52] = 0x65
[53] = 0xc3
[54] = 0xb1
[55] = 0x6f
[56] = 0x72
[57] = 0x00
[58] = 0x00
[59] = 0x00
[60] = 0x4f
[61] = 0x08
[62] = 0x00
[63] = 0x08
[64] = 0x00
[65] = 0x00
[66] = 0x04
[67] = 0x13
[68] = 0xff
[69] = 0xff
[70] = 0x00
[71] = 0x00

where I believe I have the string señor.

With the first array it's easy to find the name as the first 4 bytes with 00 as a terminator but how do I decipher whats on the second byte array?

Both arrays are vector<char>s.

10
  • It seems like it is using UTF-8 encoding. Commented Dec 20, 2016 at 17:58
  • What Unicode? UTF16, UTF8? BTW C++11 and later have utf16 literals so you don't need to parse anything. If bytes 28-31 are name, you probably use UTF8. In any case, a UTF8 string in C++ is a std::string Commented Dec 20, 2016 at 18:00
  • @PanagiotisKanavos - how do I know where the name ends in the 2nd byte array? Commented Dec 20, 2016 at 18:01
  • Check C++ String and Character Literals for the current state of Unicode support. What do you want to do with this array? Do you want to display it? Work with individual characters? It may be easier to convert it to a char16_t/u16string Commented Dec 20, 2016 at 18:03
  • If you don't know in advance the used UTF[8/16/32] or Windows Codepage encode, the only option is try-error method. You should learn at least the differences between UTF8 and UTF16 and Codepage. Commented Dec 20, 2016 at 18:05

1 Answer 1

1

The text is obviously using UTF-8 encoding:

[53] = 0xc3
[54] = 0xb1

This is the UTF-8 encoded ñ character. And the surrounding characters are the remaining four characters in señor.

The C++ library does have some facilities for working with UTF-8; but I always found those library classes somewhat awkward and inflexible. On most platforms, you have an excellent, flexible iconv library with a simple, easy API for converting between UTF-8 and other encodings.

Sign up to request clarification or add additional context in comments.

2 Comments

How do I know where the unicode string ends in the byte array?
By understanding where the strings come from, and what process creates them. There is no sign that individual bytes wear around their neck that say "this is where the string ends". In C-style strings, they get terminated by a '\0' byte. If these are C-style strings, that's how you find their end.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.