I have a list of bytes (8-bit bytes, or in C/C++ language they form a wchar_t type string), they form a Unicode string (byte by byte). How to convert those values into a Python string? I tried a few things, but none could join those 2 bytes into 1 character and build an entire string from it. Thank you.
3 Answers
Converting a sequence of bytes to a Unicode string is done by calling the decode() method on that str (in Python 2.x) or bytes (Python 3.x) object.
If you actually have a list of bytes, then, to get this object, you can use ''.join(bytelist) or b''.join(bytelist).
You need to specify the encoding that was used to encode the original Unicode string.
However, the term "Python string" is a bit ambiguous and also version-dependent. The Python str type stands for a byte string in Python 2.x and a Unicode string in Python 3.x. So, in Python 2, just doing ''.join(bytelist) will give you a str object.
Demo for Python 2:
In [1]: 'тест'
Out[1]: '\xd1\x82\xd0\xb5\xd1\x81\xd1\x82'
In [2]: bytelist = ['\xd1', '\x82', '\xd0', '\xb5', '\xd1', '\x81', '\xd1', '\x82']
In [3]: ''.join(bytelist).decode('utf-8')
Out[3]: u'\u0442\u0435\u0441\u0442'
In [4]: print ''.join(bytelist).decode('utf-8') # encodes to the terminal encoding
тест
In [5]: ''.join(bytelist) == 'тест'
Out[5]: True
3 Comments
Out[3] from the answer would show a regular (Unicode) string. Output 4 would print the string (almost the same thing).you can also convert the byte list into string list using the decode()
stringlist=[x.decode('utf-8') for x in bytelist]
1 Comment
b'\x7f' as UTF-8, which is what the code in this answer would do. And given that the OP has stated they have 8-bit bytes from a C++ wchar_t data type it is almost guaranteed to not be ASCII or UTF-8.