Python UnicodeDecodeError - How to correctly read unicode strings from subprocess?

Question

I am having problems with subprocesses in Python which return unicode characters, especially the German ü, ä, ö characters.

My script basically wants to open a subprocess, which returns some strings with the stdout.read() function. Some of those strings may contain unicode characters, but it is not always known if and where those characters are. So the output has to be decoded (or encoded?) somehow to correctly display the string. A byte-object is not possible for me to work with.

The following code displays in short what I try to do, but fails to decode the string, hence the "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 12: invalid start byte" Error-Message:

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
command = subprocess.Popen(command_array, stdout=subprocess.PIPE, shell=True)

command_output = command.stdout.read()
command_output = command_output.decode()
print(command_output)

I feel that there has to be some trivial solution to this, which I failed to find anywhere. Is there any way to correctly return those unicode characters in a string?

I am using Python 3.6.3, and the above script runs on Windows. A version which works under Linux as well will be equally appreciated!

Are you sure the encoding is utf-8 and not iso-8859-1 for example? — Daniel Roseman
– Daniel Roseman, Commented Nov 13, 2018 at 11:15
Note also, once you do find the correct encoding, you can pass it to Popen as the encoding argument and then it will be automatically decoded to str for you. — Daniel Roseman
– Daniel Roseman, Commented Nov 13, 2018 at 11:16
Passing in a list of tokens is fundamentally incompatible with shell=True, though it probably happens to work more or less by mistake on your platform. — tripleee
– tripleee, Commented Nov 13, 2018 at 11:31
Thanks for the replies! So it is not iso-8859-1, as it only returns empty spaces instead of the chatacters. Is there any way to find the correct encoding, or do I have to try manually? — Johannes M.
– Johannes M., Commented Nov 13, 2018 at 11:35
If the encoding of your Python script is something else than UTF-8, you obviously aren't asking the shell to echo UTF-8 characters. If you have everything set up for UTF-8, you should be fine. — tripleee
– tripleee, Commented Nov 13, 2018 at 11:36

Stop harming Monica · Accepted Answer · 2018-11-13 12:24:33Z

1

I have found by trial and error that decoding with cp850 works and yields the expected output:

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
command = subprocess.Popen(command_array, stdout=subprocess.PIPE, shell=True)

command_output = command.stdout.read()
command_output = command_output.decode('cp850')
print(command_output)

If you save the above code as a utf8 encoded file (the default for python3 regardless the platform) and run it with python3 it prints:

string_with_ü_ä_ö

Unfortunately I don't know where or why this particular encoding is chosen so this might not work with different setups but at least I am confident it will with yours.

answered Nov 13, 2018 at 12:24

Stop harming Monica

12.7k1 gold badge40 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

tripleee Over a year ago

In the general case, there is no guarantee or expectation that this particular code page is the right one for your system. See (again) stackoverflow.com/questions/31469707/…

Stop harming Monica Over a year ago

@tripleee All I know is that in my box the output comes encoded with that particular encoding (or a similar one) in this an other similar cases. The error message tells me that it is the same for the OP.As I said I have no idea where this encoding comes from or what it depends on. I don't have my windows box handy but I am pretty sure this is not the encoding used by python open, that is some variant of latin.

Piotr Over a year ago

this is amazing. i was trying to read the git log from git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/… and it choked due to the character in the return from git.

tripleee · Accepted Answer · 2018-11-13 11:34:03Z

1

With Python >= 3.6, you want subprocess.run() with universal_newlines=True

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
result = subprocess.run(command_array,
    stdout=subprocess.PIPE, universal_newlines=True)
print(result.stdout)

In Python 3.7 the universal_newlines alias was replaced with text which better explains what the option actually does.

answered Nov 13, 2018 at 11:34

tripleee

192k37 gold badges318 silver badges367 bronze badges

6 Comments

tripleee Over a year ago

For (much) more on what all of this means, see also stackoverflow.com/a/51950538/874188

tripleee Over a year ago

Maybe your Windows code page is something odd? stackoverflow.com/questions/31469707/…

Johannes M. Over a year ago

I tried that (with the addition of shell=True) and it gives me the following Error: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 12: character maps to <undefined>

tripleee Over a year ago

shell=True is decidedly and utterly wrong here, as already detailed in a previous comment. Can you figure out what character set your Windows is producing, and what encoding the Python script is in, as already asked above?

tripleee Over a year ago

Oh, if it's just to get echo because you don't have an external tool with that name, replace the echo with something else. I imagine it's just a placeholder for something much more complex. For the sake of argument, try ['python', '-c', 'print("Hällö")']instead of adding the wickedness that is the Windows command shell.

|

Collectives™ on Stack Overflow

Python UnicodeDecodeError - How to correctly read unicode strings from subprocess?

2 Answers 2

3 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related