c# string to c++ wstring using Encoding.Unicode.Getbytes()

Question

So the issue is that when using c# the char is 4 bytes so "abc" is (65 0 66 0 67 0).

When inputing that to a wstring in c++ thru sending it in a socket i get the following output a.

How i am able to convert such a string to a c++ string?

i get the following output a. that's because you tried the bytes as a std:string, which is suitable only for single-byte codepages or, due to lack of standardization in C++, UTF8. This interpreted the first null byte as the end of a string. You should use std::u16string to read UTF16 bytes. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 23, 2020 at 16:21
BTW you didn't post either your C# or C++ code, but the bug suggests you're trying to read strings one at a time from C++. For that to work, you need to terminate strings from C#'s side by emitting the appropriate NUL: either a single 0x00 byte for UTF8, or two 0x00 for UTF16. — Panagiotis Kanavos
– Panagiotis Kanavos, Commented Nov 23, 2020 at 16:23

Robert Harvey · Accepted Answer · 2020-11-23 16:25:15Z

1

Sounds like you need ASCII or UTF-8 encoding instead of Unicode.

65 0 66 0 67 0 is only going to get you the A, since the next zero is interpreted as a null termination character in C++.

Strategies for converting Unicode to ASCII can be found here.

edited Nov 23, 2020 at 16:25

answered Nov 23, 2020 at 16:22

Robert Harvey

182k48 gold badges349 silver badges516 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

canton7 Over a year ago

ASCII or UTF-8 or one of the single-byte encodings. The question does say wstring however.

Panagiotis Kanavos Over a year ago

Either UTF8 encoding, or u16string on C++'s side. The 7-bit US-ASCII encoding will mangle any non-English text. UTF8 will emit the same bytes as US-ASCII for English text

gringothepro Over a year ago

Hello i need this to be loss less and when you convert it to ASCII there is no way to recover the data.

Robert Harvey Over a year ago

Sounds like binary data to me, not a string.

Bernd · Accepted Answer · 2020-11-23 22:40:34Z

0

using c# the char is 4 bytes

No, in CSharp Strings are encoded in UTF16. Code units need at least two bytes in UTF16. For simple charachters a single code unit can represent a code point (e.g. 65 0).

On Windows wstring is usually UTF16 (2-4 Bytes) encoded, too. But on Unix/Linux wstring uses usually UTF32-Encoding (always 4 Bytes).

The Unicode code Point has the same numerical value compared to ASCII - therefore UTF-16 encoded ASCII text looks often like this: {num} 0 {num} 0 {num} 0... See the details here: (https://en.wikipedia.org/wiki/UTF-16)

Could you show us some Code, how you constructed your wstring object? The null byte is critical here, because it was the end marker for ASCII / ANSI Strings.

answered Nov 23, 2020 at 22:40

Bernd

2,46914 silver badges26 bronze badges

7 Comments

gringothepro Over a year ago

Well i am just passing it the raw data from the socket like so std::vector<char> data = socket.read(); then std::wcout << std::wstring(data.begin(), data.end()); sorry for syntax erorrs (if any) written in this box also i dont know how to print a u16string but i have tried it in my file pharser and it didnt change any thing

Bernd Over a year ago

Well, then I think "socket" stops reading on null byte.

Bernd Over a year ago

If you take this code: godbolt.org/z/Yds7of. The string constructions seems to work (which is a bit surprising for me).

gringothepro Over a year ago

No socket is a custom fuction written by me i have looked at the var when in a break point and the data is there it just ends the string at \0

Bernd Over a year ago

So the vector has 6 bytes in it... To handle utf16 you should do something like this: auto strView = std::u16string_view( reinterpret_cast<char16_t*>(&data[0]), data.size() );

|

gringothepro · Accepted Answer · 2020-11-24 17:49:11Z

0

I have been able to solve the issue by using a std::u16string. Here is some example code

std::vector<char> data = { 65, 0, 66, 0, 67, 0 };
std::u16string string(&data[0], data.size() / 2);
// now string should be encoded right

answered Nov 24, 2020 at 17:49

gringothepro

11 silver badge2 bronze badges

Collectives™ on Stack Overflow

c# string to c++ wstring using Encoding.Unicode.Getbytes()

3 Answers 3

4 Comments

7 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related