9

I'm trying to find a generic solution to print unicode strings from a python script.

The requirements are that it must run in both python 2.7 and 3.x, on any platform, and with any terminal settings and environment variables (e.g. LANG=C or LANG=en_US.UTF-8).

The python print function automatically tries to encode to the terminal encoding when printing, but if the terminal encoding is ascii it fails.

For example, the following works when the environment "LANG=enUS.UTF-8":

x = u'\xea'
print(x)

But it fails in python 2.7 when "LANG=C":

UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 0: ordinal not in range(128)

The following works regardless of the LANG setting, but would not properly show unicode characters if the terminal was using a different unicode encoding:

print(x.encode('utf-8'))

The desired behavior would be to always show unicode in the terminal if it is possible and show some encoding if the terminal does not support unicode. For example, the output would be UTF-8 encoded if the terminal only supported ascii. Basically, the goal is to do the same thing as the python print function when it works, but in the cases where the print function fails, use some default encoding.

4
  • • Python 3.0 provides an alternative string type for binary data and supports Unicode text in its normal string type (ASCII is treated as a simple type of Unicode). • Python 2.6 provides an alternative string type for non-ASCII Unicode text and supports both simple text and binary data in its normal string type. so now whats your question ? Commented Dec 7, 2014 at 21:06
  • Any terminal settings and environment variables? Including incorrect ones? :^) Commented Dec 7, 2014 at 22:35
  • Yes, because I get bug reports even when the user's environment is the main problem, so I'd like to try to make the code as robust as possible. Commented Dec 7, 2014 at 22:55
  • I believe the OP's requirements are that incorrect terminal settings (such as the C locale) should present the user with a reasonable default, such as UTF-8, not with a UnicodeEncodeError and a traceback. The former can produce garbage at worst (but will do what the user wants on all modern systems), whereas the latter is bound to frustrate the user. Commented Dec 8, 2014 at 20:52

4 Answers 4

13

You can handle the LANG=C case by telling sys.stdout to default to UTF-8 in cases when it would otherwise default to ASCII.

import sys, codecs

if sys.stdout.encoding is None or sys.stdout.encoding == 'ANSI_X3.4-1968':
    utf8_writer = codecs.getwriter('UTF-8')
    if sys.version_info.major < 3:
        sys.stdout = utf8_writer(sys.stdout, errors='replace')
    else:
        sys.stdout = utf8_writer(sys.stdout.buffer, errors='replace')

print(u'\N{snowman}')

The above snippet fulfills your requirements: it works in Python 2.7 and 3.4, and it doesn't break when LANG is in a non-UTF-8 setting such as C.

It is not a new technique, but it's surprisingly hard to find in the documentation. As presented above, it actually respects non-UTF-8 settings such as ISO 8859-*. It only defaults to UTF-8 if Python would have bogusly defaulted to ASCII, breaking the application.

Sign up to request clarification or add additional context in comments.

Comments

2

I don't think you should try and solve this at the Python level. Document your application requirements, log the locale of systems you run on so it can be included in bug reports and leave it at that.

If you do want to go this route, at least distinguish between terminals and pipes; you should never output data to a terminal that the terminal cannot explicitly handle; don't output UTF-8 for example, as the non-printable codepoints > U+007F could end up being interpreted as control codes when encoded.

For a pipe, output UTF-8 by default and make it configurable.

So you'd detect if a TTY is being used, then handle encoding based on that; for a terminal, set an error handler (pick one of replace or backslashreplace to provide replacement characters or escape sequences for whatever characters cannot be handled). For a pipe, use a configurable codec.

import codecs
import os
import sys

if os.isatty(sys.stdout.fileno()):
    output_encoding = sys.stdout.encoding
    errors = 'replace'
else:
    output_encoding = 'utf-8'  # allow override from settings
    errors = None  # perhaps parse from settings, not needed for UTF8
sys.stdout = codecs.getwriter(output_encoding)(sys.stdout, errors=errors)

12 Comments

Setting environment variables contradicts the requirements in the question: "with any terminal settings and environment variables". The second option and why it does not work is already mentioned in the question.
@clark800: right. I think that that is not a good idea. I've given you your options anyway, but consider that the problem lies with the user, really.
Sometimes the output encoding is unavailable for no good reason, e.g. simply because the script is run by cron or in the "C" locale, or in a pipeline instead of on a TTY. When the output encoding is unavailable, defaulting to UTF-8 is the entirely reasonable thing to do, and is what modern systems default to anyway. It is certainly more reasonable that defaulting to ASCII and raising an exception at the unsuspecting user. "Documenting application requirements" doesn't help for a script designed to be run by actual end users as opposed to system administrators or programmers.
@user4815162342: the pipe scenario is covered here; cron also doesn't offer a TTY, so in both cases it'd output UTF-8 here.
You are right. I was answering primarily to the first paragraph of your answer, paying insufficient attention to the code.
|
0

You can encode the string yourself with the special parameter 'backslashreplace' so that unrepresentable characters are converted to escape sequences. In Python 2 you can print the result of encode directly, but in Python 3 you need to decode it back to Unicode first.

import sys
encoding = sys.stdout.encoding
print(s.encode(encoding, 'backslashreplace').decode(encoding))

If sys.stdout.encoding doesn't deliver the value that your terminal can handle, that's a separate problem that you must deal with.

Comments

-1

You can handle the exception:

def always_print(s):
    try:
        print(s)
    except UnicodeEncodeError:
        print(s.encode('utf-8'))

3 Comments

What if the terminal encoding is something completely unrelated to ascii? Encoding as utf-8 would make it appear as gibberish.
Yes, it would. Which is why you try a regular print call first to use the terminal's encoding.
What I meant is if the terminal encoding is a non-unicode encoding unrelated to ascii, so it still fails to print directly and then displays gibberish due to the wrong encoding.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.