1

From command line parameters (sys.argv) I receive string of unicode literals like this: '\u041f\u0440\u0438\u0432\u0435\u0442\u0021'

For example this script uni.py:

import sys
print(sys.argv[1])

command line:

python uni.py \u041f\u0440\u0438\u0432\u0435\u0442\u0021

output:

\u041f\u0440\u0438\u0432\u0435\u0442\u0021

I want to convert it to unicode string 'Привет!'

2
  • 2
    Please clarify what you want to do. '\u041f\u0440\u0438\u0432\u0435\u0442\u0021' is the string 'Привет!'. Commented Mar 15, 2020 at 12:40
  • To clarify the above: that representation is Python's representation only, because some terminals cannot print Unicode. Do this simple experiment: print out the ordinal value of the first character. You will see it is 1055 (0x41f in decimal), and not 92, the value for a backslash (nor 39 – the single quote – because that is also not "part of the string", even though it gets printed by Python as well). Commented Mar 15, 2020 at 13:51

1 Answer 1

1

You don't have to convert it the Unicode, because it already is Unicode. In Python 3.x, strings are Unicode by default. You only have to convert them (to or from bytes) when you want to read or write bytes, for example, when writing to a file.

If you just print the string, you'll get the correct result, assuming your terminal supports the characters.

print('\u041f\u0440\u0438\u0432\u0435\u0442\u0021')

This will print:

Привет!

UPDATE

After updating your question it became clear to me that the mentioned string is not really a string literal (or unicode literal), but input from the command line. In that case you could use the "unicode-escape" encoding to get the result you want. Note that encoding works from Unicode to bytes, and decoding works from bytes to Unicode. In this case you want a transformation from Unicode to Unicode, so you have to add a "dummy" decoding step using latin-1 encoding, which transparently converts Unicode codepoints to bytes.

The following code will print the correct result for your example:

text = sys.argv[1].encode('latin-1').decode('unicode-escape')
print(text)

UPDATE 2

Alternatively, you could use ast.literal_eval() to parse the string from the input. However, this method expects a proper Python literal, including the quotes. You could do something like to solve this:

text = ast.literal_eval("'" + sys.argv[1] + "'")

But note that this would break if you would have a quote as part of your input string. I think it's a bit of a hack, since the method is probably not intended for the purpose you use it. The unicode-escape is simpler and robuster. However, what the best solution is depends on what you're building.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.