1

Let's say we have file main.cpp in windows-1251 encoding with such content:

int main()
{
     wchar_t* ws = L"котэ"; //cat in russian
     return 0;
}

Everything is fine if we compile this in VisualStudio, BUT we gonna compile it with GCC which default encoding for source code is UTF-8. Of course we can convert file encoding or set option "-finput-charset=windows-1251" for compiler, but what if not? There is some way to do this by replacing raw text into hex UTF32 bytes:

int main()
    {
         wchar_t* ws = (wchar_t*)"\x3A\x04\x00\x00\x3E\x04\x00\x00\x42\x04\x00\x00\x4D\x04\x00\x00\x00\x00\x00\x00"; //cat in russian
         return 0;
    }

But it's kind of agly: 4 letters becomes a 20 bytes ((

How else it can be done?

9
  • 1
    Use wchar_t* ws = L"котэ"; Commented Nov 1, 2016 at 12:31
  • @πάνταῥεῖ, incorrect. That will work only if file main.cpp has utf8 encoding Commented Nov 1, 2016 at 12:32
  • @AlekDepler U"котэ" then. Commented Nov 1, 2016 at 12:35
  • @πάνταῥεῖ, there is no C++ 11 and I'm pretty sure the problem remains Commented Nov 1, 2016 at 12:36
  • 2
    @SamVarshavchik, "cat" in ukranian is "кiт", go finish the school pls. Commented Nov 1, 2016 at 12:37

1 Answer 1

1

What you need is to use a file encoding that is understood by both GCC and VS. It seems to me that saving the file in UTF-8 encoding is the way forward.

Also see: How can I make Visual Studio save all files as UTF-8 without signature on Project or Solution level?

Sign up to request clarification or add additional context in comments.

4 Comments

I know that and at first look this is simple and obvious problem. But the thing is: if you try to define single-byte non-english string (char* s = "котэ";), save it in utf8 and compile in VisualStudio... guess what? You'll get raw utf8 bytes in that string instead of system locale related, which leads to various problems among all the code (for example strlen will not be able to calculate length correctly).
@AlekDepler you should be staying consistent with your usage of char and wchar_t, or you will definitely run into trouble like that. If you must mix them, it's probably best if you keep the char strings to ASCII only.
Yes, there are many different API:s that you need to consider. In VS/Windows the convention is to use wchar for anything Unicode (e.g. filenames), while in Linux it is more common to use char and interpret it as UTF-8. strlen and friends, as you say, count bytes, not Unicode characters - which is fine for most purposes (e.g. to determine how much memory to allocate or copy). If you want to write portable code, you need to be careful with how you use your Unicode strings.
One approach is to go all-in on UTF-8: encode your files as UTF-8, and use UTF-8 strings all over (use char* strings and/or std::string, and always interpret them as UTF-8). Then you can use a library such as UTF8-CPP to convert to/from UTF-16 and get the true string length (number of Unicode code points of a string), etc.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.