1

I am having problems with subprocesses in Python which return unicode characters, especially the German ü, ä, ö characters.

My script basically wants to open a subprocess, which returns some strings with the stdout.read() function. Some of those strings may contain unicode characters, but it is not always known if and where those characters are. So the output has to be decoded (or encoded?) somehow to correctly display the string. A byte-object is not possible for me to work with.

The following code displays in short what I try to do, but fails to decode the string, hence the "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 12: invalid start byte" Error-Message:

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
command = subprocess.Popen(command_array, stdout=subprocess.PIPE, shell=True)

command_output = command.stdout.read()
command_output = command_output.decode()
print(command_output)

I feel that there has to be some trivial solution to this, which I failed to find anywhere. Is there any way to correctly return those unicode characters in a string?

I am using Python 3.6.3, and the above script runs on Windows. A version which works under Linux as well will be equally appreciated!

6
  • 1
    Are you sure the encoding is utf-8 and not iso-8859-1 for example? Commented Nov 13, 2018 at 11:15
  • Note also, once you do find the correct encoding, you can pass it to Popen as the encoding argument and then it will be automatically decoded to str for you. Commented Nov 13, 2018 at 11:16
  • Passing in a list of tokens is fundamentally incompatible with shell=True, though it probably happens to work more or less by mistake on your platform. Commented Nov 13, 2018 at 11:31
  • Thanks for the replies! So it is not iso-8859-1, as it only returns empty spaces instead of the chatacters. Is there any way to find the correct encoding, or do I have to try manually? Commented Nov 13, 2018 at 11:35
  • If the encoding of your Python script is something else than UTF-8, you obviously aren't asking the shell to echo UTF-8 characters. If you have everything set up for UTF-8, you should be fine. Commented Nov 13, 2018 at 11:36

2 Answers 2

1

I have found by trial and error that decoding with cp850 works and yields the expected output:

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
command = subprocess.Popen(command_array, stdout=subprocess.PIPE, shell=True)

command_output = command.stdout.read()
command_output = command_output.decode('cp850')
print(command_output)

If you save the above code as a utf8 encoded file (the default for python3 regardless the platform) and run it with python3 it prints:

string_with_ü_ä_ö

Unfortunately I don't know where or why this particular encoding is chosen so this might not work with different setups but at least I am confident it will with yours.

Sign up to request clarification or add additional context in comments.

3 Comments

In the general case, there is no guarantee or expectation that this particular code page is the right one for your system. See (again) stackoverflow.com/questions/31469707/…
@tripleee All I know is that in my box the output comes encoded with that particular encoding (or a similar one) in this an other similar cases. The error message tells me that it is the same for the OP.As I said I have no idea where this encoding comes from or what it depends on. I don't have my windows box handy but I am pretty sure this is not the encoding used by python open, that is some variant of latin.
this is amazing. i was trying to read the git log from git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/… and it choked due to the character in the return from git.
1

With Python >= 3.6, you want subprocess.run() with universal_newlines=True

import subprocess

command_array = ['echo', 'string_with_ü_ä_ö']
result = subprocess.run(command_array,
    stdout=subprocess.PIPE, universal_newlines=True)
print(result.stdout)

In Python 3.7 the universal_newlines alias was replaced with text which better explains what the option actually does.

6 Comments

For (much) more on what all of this means, see also stackoverflow.com/a/51950538/874188
Maybe your Windows code page is something odd? stackoverflow.com/questions/31469707/…
I tried that (with the addition of shell=True) and it gives me the following Error: "UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 12: character maps to <undefined>
shell=True is decidedly and utterly wrong here, as already detailed in a previous comment. Can you figure out what character set your Windows is producing, and what encoding the Python script is in, as already asked above?
Oh, if it's just to get echo because you don't have an external tool with that name, replace the echo with something else. I imagine it's just a placeholder for something much more complex. For the sake of argument, try ['python', '-c', 'print("Hällö")']instead of adding the wickedness that is the Windows command shell.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.