Define unicode string as byte array

Question

Let's say we have file main.cpp in windows-1251 encoding with such content:

int main()
{
     wchar_t* ws = L"котэ"; //cat in russian
     return 0;
}

Everything is fine if we compile this in VisualStudio, BUT we gonna compile it with GCC which default encoding for source code is UTF-8. Of course we can convert file encoding or set option "-finput-charset=windows-1251" for compiler, but what if not? There is some way to do this by replacing raw text into hex UTF32 bytes:

int main()
    {
         wchar_t* ws = (wchar_t*)"\x3A\x04\x00\x00\x3E\x04\x00\x00\x42\x04\x00\x00\x4D\x04\x00\x00\x00\x00\x00\x00"; //cat in russian
         return 0;
    }

But it's kind of agly: 4 letters becomes a 20 bytes ((

How else it can be done?

@πάνταῥεῖ, incorrect. That will work only if file main.cpp has utf8 encoding — user6416335
– user6416335, Commented Nov 1, 2016 at 12:32
@πάνταῥεῖ, there is no C++ 11 and I'm pretty sure the problem remains — user6416335
– user6416335, Commented Nov 1, 2016 at 12:36
@SamVarshavchik, "cat" in ukranian is "кiт", go finish the school pls. — user6416335
– user6416335, Commented Nov 1, 2016 at 12:37

Community · Accepted Answer · 2017-05-23 12:16:35Z

1

What you need is to use a file encoding that is understood by both GCC and VS. It seems to me that saving the file in UTF-8 encoding is the way forward.

Also see: How can I make Visual Studio save all files as UTF-8 without signature on Project or Solution level?

edited May 23, 2017 at 12:16

CommunityBot

11 silver badge

answered Nov 1, 2016 at 15:58

m-bitsnbites

1,1048 silver badges20 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user6416335 Over a year ago

I know that and at first look this is simple and obvious problem. But the thing is: if you try to define single-byte non-english string (char* s = "котэ";), save it in utf8 and compile in VisualStudio... guess what? You'll get raw utf8 bytes in that string instead of system locale related, which leads to various problems among all the code (for example strlen will not be able to calculate length correctly).

Mark Ransom Over a year ago

@AlekDepler you should be staying consistent with your usage of char and wchar_t, or you will definitely run into trouble like that. If you must mix them, it's probably best if you keep the char strings to ASCII only.

m-bitsnbites Over a year ago

Yes, there are many different API:s that you need to consider. In VS/Windows the convention is to use wchar for anything Unicode (e.g. filenames), while in Linux it is more common to use char and interpret it as UTF-8. strlen and friends, as you say, count bytes, not Unicode characters - which is fine for most purposes (e.g. to determine how much memory to allocate or copy). If you want to write portable code, you need to be careful with how you use your Unicode strings.

m-bitsnbites Over a year ago

One approach is to go all-in on UTF-8: encode your files as UTF-8, and use UTF-8 strings all over (use char* strings and/or std::string, and always interpret them as UTF-8). Then you can use a library such as UTF8-CPP to convert to/from UTF-16 and get the true string length (number of Unicode code points of a string), etc.

Collectives™ on Stack Overflow

Define unicode string as byte array

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related