decode python binary string but not ensure ascii symbols

Question

I have a binary object:

b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'

and I want it to be printed in Unicode and not strictly using ASCII symbols.

There is a hacky way to do it:

decoded = string.decode()
parsed_to_dict = json.loads(decoded)
dumped = json.dumps(parsed_to_dict, ensure_ascii=False)
print(dumped)

>>> {"node": "Обновление"}

however the text will not always be parseable as JSON, so I need a simpler way.

Is there a way to print out my binary object (or a decoded Unicode string) as a non-ascii string without going trough parsing/dumping JSON?

For example, how to print this b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435' as Обновление?

@deceze It's not unclear what he's asking IMO. They want to remove the escape backslashes to get that result. They're saying they've found a way in the case that it's a json string, but they want a method in the general case. — FHTMitchell
– FHTMitchell, Commented May 18, 2018 at 11:22
@FHT Sure, but this example looks like JSON. Both JSON parsing and AST-literal parsing work on that, yes. But if the concern is that in some cases it may not be valid JSON… well then, what will it be? Valid Python which works with AST? Or something entirely different? — deceze
– deceze ♦, Commented May 18, 2018 at 11:23
I guess you could do data.decode('unicode-escape'). But I'd be wary of recommending that without knowing what variations are possible in the input data. — PM 2Ring
– PM 2Ring, Commented May 18, 2018 at 11:27

PM 2Ring · Accepted Answer · 2018-05-20 00:34:56Z

3

A bytes string like

b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'

has been encoded using Unicode escape sequences. To convert it back into a proper Unicode string you simply need to specify the 'unicode-escape' codec:

data = b'\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.decode('unicode-escape')
print(out)

output

Обновление

However, if data is already a Unicode string, then you first need to encode it to bytes. You can do that using the ascii codec, presuming data only contains ASCII characters. If it contains characters outside ASCII but within the range of \x80 to \xff you may be able to use the 'latin1' codec.

data = '\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435'
out = data.encode('ascii').decode('unicode-escape')

answered May 20, 2018 at 0:34

PM 2Ring

55.6k6 gold badges96 silver badges201 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

FHTMitchell · Accepted Answer · 2018-05-18 11:26:48Z

0

This should work so long as all the escapes are valid (no single \).

import ast
bytes_object = b'{"node": "\\u041e\\u0431\\u043d\\u043e\\u0432\\u043b\\u0435\\u043d\\u0438\\u0435"}}'

unicode_string = ast.literal_eval("'{}'".format(bytes_object.decode()))

output:

'{"node": "Обновление"}}'

edited May 18, 2018 at 11:26

answered May 18, 2018 at 11:20

FHTMitchell

12.2k2 gold badges40 silver badges50 bronze badges

Collectives™ on Stack Overflow

decode python binary string but not ensure ascii symbols

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related