0

So the issue is that when using c# the char is 4 bytes so "abc" is (65 0 66 0 67 0).

When inputing that to a wstring in c++ thru sending it in a socket i get the following output a.

How i am able to convert such a string to a c++ string?

3
  • i get the following output a. that's because you tried the bytes as a std:string, which is suitable only for single-byte codepages or, due to lack of standardization in C++, UTF8. This interpreted the first null byte as the end of a string. You should use std::u16string to read UTF16 bytes. Commented Nov 23, 2020 at 16:21
  • BTW you didn't post either your C# or C++ code, but the bug suggests you're trying to read strings one at a time from C++. For that to work, you need to terminate strings from C#'s side by emitting the appropriate NUL: either a single 0x00 byte for UTF8, or two 0x00 for UTF16. Commented Nov 23, 2020 at 16:23
  • the char is 4 bytes: no, two bytes Commented Nov 23, 2020 at 16:34

3 Answers 3

1

Sounds like you need ASCII or UTF-8 encoding instead of Unicode.

65 0 66 0 67 0 is only going to get you the A, since the next zero is interpreted as a null termination character in C++.

Strategies for converting Unicode to ASCII can be found here.

Sign up to request clarification or add additional context in comments.

4 Comments

ASCII or UTF-8 or one of the single-byte encodings. The question does say wstring however.
Either UTF8 encoding, or u16string on C++'s side. The 7-bit US-ASCII encoding will mangle any non-English text. UTF8 will emit the same bytes as US-ASCII for English text
Hello i need this to be loss less and when you convert it to ASCII there is no way to recover the data.
Sounds like binary data to me, not a string.
0

using c# the char is 4 bytes

No, in CSharp Strings are encoded in UTF16. Code units need at least two bytes in UTF16. For simple charachters a single code unit can represent a code point (e.g. 65 0).

On Windows wstring is usually UTF16 (2-4 Bytes) encoded, too. But on Unix/Linux wstring uses usually UTF32-Encoding (always 4 Bytes).

The Unicode code Point has the same numerical value compared to ASCII - therefore UTF-16 encoded ASCII text looks often like this: {num} 0 {num} 0 {num} 0... See the details here: (https://en.wikipedia.org/wiki/UTF-16)

Could you show us some Code, how you constructed your wstring object? The null byte is critical here, because it was the end marker for ASCII / ANSI Strings.

7 Comments

Well i am just passing it the raw data from the socket like so std::vector<char> data = socket.read(); then std::wcout << std::wstring(data.begin(), data.end()); sorry for syntax erorrs (if any) written in this box also i dont know how to print a u16string but i have tried it in my file pharser and it didnt change any thing
Well, then I think "socket" stops reading on null byte.
If you take this code: godbolt.org/z/Yds7of. The string constructions seems to work (which is a bit surprising for me).
No socket is a custom fuction written by me i have looked at the var when in a break point and the data is there it just ends the string at \0
So the vector has 6 bytes in it... To handle utf16 you should do something like this: auto strView = std::u16string_view( reinterpret_cast<char16_t*>(&data[0]), data.size() );
|
0

I have been able to solve the issue by using a std::u16string. Here is some example code

std::vector<char> data = { 65, 0, 66, 0, 67, 0 };
std::u16string string(&data[0], data.size() / 2);
// now string should be encoded right

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.