Python3 - Convert unicode literals string to unicode string

Question

From command line parameters (sys.argv) I receive string of unicode literals like this: '\u041f\u0440\u0438\u0432\u0435\u0442\u0021'

For example this script uni.py:

import sys
print(sys.argv[1])

command line:

python uni.py \u041f\u0440\u0438\u0432\u0435\u0442\u0021

output:

\u041f\u0440\u0438\u0432\u0435\u0442\u0021

I want to convert it to unicode string 'Привет!'

Please clarify what you want to do. '\u041f\u0440\u0438\u0432\u0435\u0442\u0021' is the string 'Привет!'. — MisterMiyagi
– MisterMiyagi, Commented Mar 15, 2020 at 12:40
To clarify the above: that representation is Python's representation only, because some terminals cannot print Unicode. Do this simple experiment: print out the ordinal value of the first character. You will see it is 1055 (0x41f in decimal), and not 92, the value for a backslash (nor 39 – the single quote – because that is also not "part of the string", even though it gets printed by Python as well). — Jongware
– Jongware, Commented Mar 15, 2020 at 13:51

wovano · Accepted Answer · 2020-03-15 15:26:10Z

You don't have to convert it the Unicode, because it already is Unicode. In Python 3.x, strings are Unicode by default. You only have to convert them (to or from bytes) when you want to read or write bytes, for example, when writing to a file.

If you just print the string, you'll get the correct result, assuming your terminal supports the characters.

print('\u041f\u0440\u0438\u0432\u0435\u0442\u0021')

This will print:

Привет!

UPDATE

After updating your question it became clear to me that the mentioned string is not really a string literal (or unicode literal), but input from the command line. In that case you could use the "unicode-escape" encoding to get the result you want. Note that encoding works from Unicode to bytes, and decoding works from bytes to Unicode. In this case you want a transformation from Unicode to Unicode, so you have to add a "dummy" decoding step using latin-1 encoding, which transparently converts Unicode codepoints to bytes.

The following code will print the correct result for your example:

text = sys.argv[1].encode('latin-1').decode('unicode-escape')
print(text)

UPDATE 2

Alternatively, you could use ast.literal_eval() to parse the string from the input. However, this method expects a proper Python literal, including the quotes. You could do something like to solve this:

text = ast.literal_eval("'" + sys.argv[1] + "'")

But note that this would break if you would have a quote as part of your input string. I think it's a bit of a hack, since the method is probably not intended for the purpose you use it. The unicode-escape is simpler and robuster. However, what the best solution is depends on what you're building.

Collectives™ on Stack Overflow

Python3 - Convert unicode literals string to unicode string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related