So you've stuck a series of bytes representing a UTF-16 encoded string into a std::string. Presumably you're doing something like de-serializing bytes that represent UTF-16, and the API for retrieving bytes to be de-serialized specifies std::string. I don't think that's the best design, but you'll handle converting it to a wstring the same as you'd handle converting the bytes to float or anything else; validate the byte buffer and then cast it:
char c[] = "\0a\0b\xd8\x3d\xdc\x7f";
std::string buf(std::begin(c),std::end(c));
assert(0==buf.size()%2);
std::wstring utf16(reinterpret_cast<wchar_t const *>(buf.data()),buf.size()/sizeof(wchar_t));
// also validate that each code unit is legal, and that there are no isolated surrogates
Things to keep in mind:
- This cast assumes that wchar_t is 16 bits, whereas most platforms use 32 bit wchar_t.
- To be useful your APIs will need to be able to treat wchar_t strings as UTF-16, either because that's the the platform specified encoding for wchar_t* or because the APIs just follow that convention.
- This cast assumes that the data matches the machine's endianess. You'll have to swap each UTF-16 code unit in the wstring otherwise. Under the UTF-16 encoding scheme if the initial bytes aren't 0xFF0xFE or 0xFE0xFF and in the absense of a higher level protocol then the UTF-16 encoding uses a big endian encoding.
- std::begin(), std::end() and string::data() are C++11
* UTF-16 doesn't actually meet the C++ language's requirements for a wchar_t encoding, but some platforms use it regardless. This causes an issue with some standard APIs that are supposed to deal in codepoints but can't simply because a wchar_t that represents a UTF-16 code unit cannot represent all the platform's codepoints.
Here's an implementation that doesn't rely on platform specific details and requires nothing more than that wchar_t be large enough to hold UTF-16 code units and that each char holds exactly 8 bits of of a UTF-16 code units. It doesn't actually validate the UTF-16 data though.
#include <string>
#include <cassert>
#include <iterator>
#include <algorithm>
#include <iostream>
enum class endian {
big,little,unknown
};
std::wstring deserialize_utf16be(std::string const &s) {
assert(0==s.size()%2);
std::wstring ws;
for(size_t i=0;i<s.size();++i)
if(i%2)
ws.back() = ws.back() | ((unsigned char)s[i] & 0xFF);
else
ws.push_back(((unsigned char)s[i] & 0xFF) << 8);
return ws;
}
std::wstring deserialize_utf16le(std::string const &s) {
assert(0==s.size()%2);
std::wstring ws;
for(size_t i=0;i<s.size();++i)
if(i%2)
ws.back() = ws.back() | (((unsigned char)s[i] & 0xFF) << 8);
else
ws.push_back((unsigned char)s[i] & 0xFF);
return ws;
}
std::wstring deserialize_utf16(std::string s, endian e=endian::unknown) {
static_assert(std::numeric_limits<wchar_t>::max() >= 0xFFFF,"wchar_t must be large enough to hold UTF-16 code units");
static_assert(CHAR_BIT>=8,"char must hold 8 bits of UTF-16 code units");
assert(0==s.size()%2);
if(endian::big == e)
return deserialize_utf16be(s);
if(endian::little == e)
return deserialize_utf16le(s);
if(2<=s.size() && ((unsigned char)s[0])==0xFF && ((unsigned char)s[1])==0xFE)
return deserialize_utf16le(s.substr(2));
if(2<=s.size() && ((unsigned char)s[0])==0xfe && ((unsigned char)s[1])==0xff)
return deserialize_utf16be(s.substr(2));
return deserialize_utf16be(s);
}
int main() {
char c[] = "\xFF\xFE\x61\0b\0\x3d\xd8\x7f\xdc";
std::string buf(std::begin(c),std::end(c)-1);
std::wstring utf16 = deserialize_utf16(buf);
std::cout << std::hex;
std::copy(begin(utf16),end(utf16),std::ostream_iterator<int>(std::cout," "));
std::cout << "\n";
}