C++ copy data from std::string to std::wstring

Question

Lets say I have a std::string, but the data is encoded in UTF-16.
How could I copy that data into a std::wstring, not modifying the data at all?

Also, I can't just use std::wstring because I'm retrieving a text file online and checking the Content-Type header field to determine encoding. But using std::string to receive the data.

There might be a better way, but you can do it through C-style strings and convert back to C++: cplusplus.com/reference/clibrary/cstdlib/mbstowcs — Qaz
– Qaz, Commented Apr 20, 2012 at 14:25

Mark Ransom · Accepted Answer · 2012-04-20 14:29:41Z

2

std::wstring PackUTF16(const std::string & input)
{
    if (input.size() % 2 != 0)
        throw std::invalid_argument("input length must be even");
    std::wstring result(input.size() / 2, 0);
    for (int i = 0;  i < result.size();  ++i)
    {
        result[i] = (input[2*i+1] & 0xff) << 8 | (input[2*i] & 0xff); // for little endian
        //result[i] = (input[2*i] & 0xff) << 8 | (input[2*i+1] & 0xff); // for big endian
    }
    return result;
}

answered Apr 20, 2012 at 14:29

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Remy Lebeau Over a year ago

why not just use memcpy() instead of a loop? memcpy(&result[0], &input[0], result.size() * 2)

Cheers and hth. - Alf Over a year ago

@RemyLebeau: because memcpy does not let you adjust the byte order. the OP does not state what the byte order of the raw data is, so memcpy would be a premature optimization, possibly adversely affecting correctness. however, there are some other (lesser) issues with this code, e.g. that it will produce warnings with Visual C++, that the function name is misleading, that it throws an exception not derived from std::runtime_error (the proper thing would be an assert), that it needlessly relies on an assumption of two's complement in order to handle negative values, and the & ops.

FailedDev · Accepted Answer · 2012-04-20 14:28:46Z

1

Try this one :

static inline std::wstring charToWide(const std::string & s_in)
{
    const char * cs = s_in.c_str();
    size_t aSize;
    if( ::mbsrtowcs_s(&aSize, NULL, 0, &cs, 0, NULL) != 0)
    {
      throw std::exception("Cannot convert string");
    }  
    std::vector<wchar_t> aBuffer(aSize);
    size_t aSizeSec;
    if (::mbstowcs_s(&aSizeSec, &aBuffer[0], aSize, cs, aSize) != 0)
    {
      throw std::exception("Cannot convert string");
    } 
    return std::wstring(&aBuffer[0], aSize - 1);
}

answered Apr 20, 2012 at 14:28

FailedDev

27k9 gold badges56 silver badges74 bronze badges

3 Comments

Cheers and hth. - Alf Over a year ago

-1 You really don't want to do that for a std::string that contains the bytes of UCS-2 encoded text. Also, even if it worked, the Microsoft functions are not portable.

FailedDev Over a year ago

@Cheersandhth.-Alf Well agreed, however first this does answer the question, second I did not see UCS-2 mentioned, third the question is tagged with windows and even if it didn't similar *nix functions exist. But to be fair I didn't write this for portability.

Cheers and hth. - Alf Over a year ago

I think, at least it sounds like, the OP is using some Google network API. Because I seem to recall others complaining that the Google stuff returns all data as very inconvenient std::string...

Cheers and hth. - Alf · Accepted Answer · 2012-04-20 14:32:53Z

1

It there is a BOM (Byte Order Mark) at the start then you check that to determine the byte order. Otherwise it's best if you know the byte order, i.e., does least significant or most significant byte come first. If you don't know the byte order and have no BOM, then you just have to try one or both and apply some statistical test and/or involve a Human Decision Maker (HDM).

Let's say that this Little Endian byte order, i.e. least significant byte first.

Then for each pair of bytes do e.g.

w.push_back( (UnsignedChar( s[2*i + 1] ) << 8u) | UnsignedChar( s[2*i] ) );

where w is a std::wstring, i is an index of wide chars < s.length()/2, UnsignedChar is a typedef of unsigned char, s is a std::string holding the data, and 8 is the number of bits per byte, i.e. you have to assume or statically assert that CHAR_BITS from the <limits.h> header is 8.

edited Apr 20, 2012 at 14:32

answered Apr 20, 2012 at 14:27

Cheers and hth. - Alf

146k15 gold badges218 silver badges342 bronze badges

5 Comments

bames53 Over a year ago

Under the UTF-16 encoding scheme if no higher level protocol specifies the byte order and the sequences does not begin with <FF FE> or <FE FF> then the sequence is big endian. See chapter 3 paragraph D98 of the Unicode 6.0 standard.

Cheers and hth. - Alf Over a year ago

@bames53: <FF FE> and <FE FF> are the BOM that I discussed. regarding the theory, that's when you know why it should work but darned, it doesn't. well, unless you regard e.g. Windows conventions as a "higher level protocol" (UTF-16 is by default little-endian in Windows). As Wikipedia puts it, "Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore the presumption of big-endian is widely ignored".

bames53 Over a year ago

When serialized to bytes on Windows UTF-16 encodings almost always start with <FF FE> or <FE FF>. I don't know that it's really all that common that little endian is used without <FF FE>, and anywhere it is I would certainly consider a bug. UCS-2 is a different story, since UCS-2 serialized to bytes is specified as always being big endian (though of course starting your text with ZWNBS is perfectly legal. On the other hand <FF FE> can't start a UCS-2 sequence at all). UCS-2 was frequently done incorrectly, but it hardly matters since Windows uses UTF-16 now.

Cheers and hth. - Alf Over a year ago

@barnes53: if what you say about UCS2 is the theory, then it's certainly in conflict with the Windows practice. And Microsoft was a founding member of the Unicode consortium, and UTF-16 was designed as an extension of UCS2. So I think it would certainly be a good idea to follow my advice above, which I also quoted Wikipedia on in the comment, that without a BOM it is necessary to either have other knowledge of the byte order, or check the result. Failure to do so will, by Murphy's law, produce gibberish. ;-)

bames53 Over a year ago

"Microsoft was a founding member of the Unicode consortium, and UTF-16 was designed as an extension of UCS2" Notice that the illegal practice in UCS-2 (using little endinan and signaling that fact with <FF FE>), common on Windows, became legal in UTF-16. But as far as I know it is not common practice on Windows to produce little endian UTF-16 without <FF FE>. As far as producing gibberish, that's what guessing at encodings already does.

bames53 · Accepted Answer · 2012-04-20 17:13:27Z

So you've stuck a series of bytes representing a UTF-16 encoded string into a std::string. Presumably you're doing something like de-serializing bytes that represent UTF-16, and the API for retrieving bytes to be de-serialized specifies std::string. I don't think that's the best design, but you'll handle converting it to a wstring the same as you'd handle converting the bytes to float or anything else; validate the byte buffer and then cast it:

char c[] = "\0a\0b\xd8\x3d\xdc\x7f";
std::string buf(std::begin(c),std::end(c));
assert(0==buf.size()%2);
std::wstring utf16(reinterpret_cast<wchar_t const *>(buf.data()),buf.size()/sizeof(wchar_t));
// also validate that each code unit is legal, and that there are no isolated surrogates

Things to keep in mind:

This cast assumes that wchar_t is 16 bits, whereas most platforms use 32 bit wchar_t.
To be useful your APIs will need to be able to treat wchar_t strings as UTF-16, either because that's the the platform specified encoding for wchar_t* or because the APIs just follow that convention.
This cast assumes that the data matches the machine's endianess. You'll have to swap each UTF-16 code unit in the wstring otherwise. Under the UTF-16 encoding scheme if the initial bytes aren't 0xFF0xFE or 0xFE0xFF and in the absense of a higher level protocol then the UTF-16 encoding uses a big endian encoding.
std::begin(), std::end() and string::data() are C++11

* UTF-16 doesn't actually meet the C++ language's requirements for a wchar_t encoding, but some platforms use it regardless. This causes an issue with some standard APIs that are supposed to deal in codepoints but can't simply because a wchar_t that represents a UTF-16 code unit cannot represent all the platform's codepoints.

Here's an implementation that doesn't rely on platform specific details and requires nothing more than that wchar_t be large enough to hold UTF-16 code units and that each char holds exactly 8 bits of of a UTF-16 code units. It doesn't actually validate the UTF-16 data though.

#include <string>
#include <cassert>

#include <iterator>
#include <algorithm>
#include <iostream>

enum class endian {
    big,little,unknown
};

std::wstring deserialize_utf16be(std::string const &s) {
    assert(0==s.size()%2);

    std::wstring ws;
    for(size_t i=0;i<s.size();++i)
        if(i%2)
            ws.back() = ws.back() | ((unsigned char)s[i] & 0xFF);
        else
            ws.push_back(((unsigned char)s[i]  & 0xFF) << 8);
    return ws;
}

std::wstring deserialize_utf16le(std::string const &s) {
    assert(0==s.size()%2);

    std::wstring ws;
    for(size_t i=0;i<s.size();++i)
        if(i%2)
            ws.back() = ws.back() | (((unsigned char)s[i] & 0xFF) << 8);
        else
            ws.push_back((unsigned char)s[i] & 0xFF);
    return ws;
}

std::wstring deserialize_utf16(std::string s, endian e=endian::unknown) {
    static_assert(std::numeric_limits<wchar_t>::max() >= 0xFFFF,"wchar_t must be large enough to hold UTF-16 code units");
    static_assert(CHAR_BIT>=8,"char must hold 8 bits of UTF-16 code units");
    assert(0==s.size()%2);

    if(endian::big == e)
        return deserialize_utf16be(s);
    if(endian::little == e)
        return deserialize_utf16le(s);

    if(2<=s.size() && ((unsigned char)s[0])==0xFF && ((unsigned char)s[1])==0xFE)
        return deserialize_utf16le(s.substr(2));
    if(2<=s.size() && ((unsigned char)s[0])==0xfe && ((unsigned char)s[1])==0xff)
        return deserialize_utf16be(s.substr(2));

    return deserialize_utf16be(s);
}


int main() {
    char c[] = "\xFF\xFE\x61\0b\0\x3d\xd8\x7f\xdc";
    std::string buf(std::begin(c),std::end(c)-1);
    std::wstring utf16 = deserialize_utf16(buf);
    std::cout << std::hex;
    std::copy(begin(utf16),end(utf16),std::ostream_iterator<int>(std::cout," "));
    std::cout << "\n";
}

Re the comment about UTF-16 and C++ requirements for wchar_t, that's possibly also the case for UTF-8 and char. Is it?
@Cheersandhth.-Alf No, multibyte encodings are explicitly allowed for char and the APIs are designed to allow for it, e.g. the use of mbstate_t. wchar_t is supposed to have distinct values for all members of the largest supported charset. (And in practice it ought to have distinct values for every character in the union of all supported charsets though I don't think that's required.) I suppose that if no supported charset includes any characters outside the BMP, and if STDC_ISO_10646 is not defined, then UTF-16 could technically be conforming.

Collectives™ on Stack Overflow

C++ copy data from std::string to std::wstring

4 Answers 4

2 Comments

3 Comments

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

3 Comments

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related