Print unicode string in python regardless of environment

Question

I'm trying to find a generic solution to print unicode strings from a python script.

The requirements are that it must run in both python 2.7 and 3.x, on any platform, and with any terminal settings and environment variables (e.g. LANG=C or LANG=en_US.UTF-8).

The python print function automatically tries to encode to the terminal encoding when printing, but if the terminal encoding is ascii it fails.

For example, the following works when the environment "LANG=enUS.UTF-8":

x = u'\xea'
print(x)

But it fails in python 2.7 when "LANG=C":

UnicodeEncodeError: 'ascii' codec can't encode character u'\xea' in position 0: ordinal not in range(128)

The following works regardless of the LANG setting, but would not properly show unicode characters if the terminal was using a different unicode encoding:

print(x.encode('utf-8'))

The desired behavior would be to always show unicode in the terminal if it is possible and show some encoding if the terminal does not support unicode. For example, the output would be UTF-8 encoded if the terminal only supported ascii. Basically, the goal is to do the same thing as the python print function when it works, but in the cases where the print function fails, use some default encoding.

• Python 3.0 provides an alternative string type for binary data and supports Unicode text in its normal string type (ASCII is treated as a simple type of Unicode). • Python 2.6 provides an alternative string type for non-ASCII Unicode text and supports both simple text and binary data in its normal string type. so now whats your question ? — Kasravnd
– Kasravnd, Commented Dec 7, 2014 at 21:06
Any terminal settings and environment variables? Including incorrect ones? :^) — Mark Tolonen
– Mark Tolonen, Commented Dec 7, 2014 at 22:35
Yes, because I get bug reports even when the user's environment is the main problem, so I'd like to try to make the code as robust as possible. — clark800
– clark800, Commented Dec 7, 2014 at 22:55
I believe the OP's requirements are that incorrect terminal settings (such as the C locale) should present the user with a reasonable default, such as UTF-8, not with a UnicodeEncodeError and a traceback. The former can produce garbage at worst (but will do what the user wants on all modern systems), whereas the latter is bound to frustrate the user. — user4815162342
– user4815162342, Commented Dec 8, 2014 at 20:52

user4815162342 · Accepted Answer · 2014-12-08 20:58:54Z

13

You can handle the LANG=C case by telling sys.stdout to default to UTF-8 in cases when it would otherwise default to ASCII.

import sys, codecs

if sys.stdout.encoding is None or sys.stdout.encoding == 'ANSI_X3.4-1968':
    utf8_writer = codecs.getwriter('UTF-8')
    if sys.version_info.major < 3:
        sys.stdout = utf8_writer(sys.stdout, errors='replace')
    else:
        sys.stdout = utf8_writer(sys.stdout.buffer, errors='replace')

print(u'\N{snowman}')

The above snippet fulfills your requirements: it works in Python 2.7 and 3.4, and it doesn't break when LANG is in a non-UTF-8 setting such as C.

It is not a new technique, but it's surprisingly hard to find in the documentation. As presented above, it actually respects non-UTF-8 settings such as ISO 8859-*. It only defaults to UTF-8 if Python would have bogusly defaulted to ASCII, breaking the application.

edited Dec 8, 2014 at 20:58

answered Dec 7, 2014 at 21:13

user4815162342

159k22 gold badges350 silver badges418 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

cheshirekow · Accepted Answer · 2019-10-31 16:43:59Z

2

I don't think you should try and solve this at the Python level. Document your application requirements, log the locale of systems you run on so it can be included in bug reports and leave it at that.

If you do want to go this route, at least distinguish between terminals and pipes; you should never output data to a terminal that the terminal cannot explicitly handle; don't output UTF-8 for example, as the non-printable codepoints > U+007F could end up being interpreted as control codes when encoded.

For a pipe, output UTF-8 by default and make it configurable.

So you'd detect if a TTY is being used, then handle encoding based on that; for a terminal, set an error handler (pick one of replace or backslashreplace to provide replacement characters or escape sequences for whatever characters cannot be handled). For a pipe, use a configurable codec.

import codecs
import os
import sys

if os.isatty(sys.stdout.fileno()):
    output_encoding = sys.stdout.encoding
    errors = 'replace'
else:
    output_encoding = 'utf-8'  # allow override from settings
    errors = None  # perhaps parse from settings, not needed for UTF8
sys.stdout = codecs.getwriter(output_encoding)(sys.stdout, errors=errors)

edited Oct 31, 2019 at 16:43

cheshirekow

4,9076 gold badges46 silver badges48 bronze badges

answered Dec 7, 2014 at 21:14

Martijn Pieters

1.1m326 gold badges4.2k silver badges3.4k bronze badges

12 Comments

clark800 Over a year ago

Setting environment variables contradicts the requirements in the question: "with any terminal settings and environment variables". The second option and why it does not work is already mentioned in the question.

Martijn Pieters Over a year ago

@clark800: right. I think that that is not a good idea. I've given you your options anyway, but consider that the problem lies with the user, really.

user4815162342 Over a year ago

Sometimes the output encoding is unavailable for no good reason, e.g. simply because the script is run by cron or in the "C" locale, or in a pipeline instead of on a TTY. When the output encoding is unavailable, defaulting to UTF-8 is the entirely reasonable thing to do, and is what modern systems default to anyway. It is certainly more reasonable that defaulting to ASCII and raising an exception at the unsuspecting user. "Documenting application requirements" doesn't help for a script designed to be run by actual end users as opposed to system administrators or programmers.

Martijn Pieters Over a year ago

@user4815162342: the pipe scenario is covered here; cron also doesn't offer a TTY, so in both cases it'd output UTF-8 here.

user4815162342 Over a year ago

You are right. I was answering primarily to the first paragraph of your answer, paying insufficient attention to the code.

|

Mark Ransom · Accepted Answer · 2014-12-08 21:25:45Z

0

You can encode the string yourself with the special parameter 'backslashreplace' so that unrepresentable characters are converted to escape sequences. In Python 2 you can print the result of encode directly, but in Python 3 you need to decode it back to Unicode first.

import sys
encoding = sys.stdout.encoding
print(s.encode(encoding, 'backslashreplace').decode(encoding))

If sys.stdout.encoding doesn't deliver the value that your terminal can handle, that's a separate problem that you must deal with.

answered Dec 8, 2014 at 21:25

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

Comments

Amber · Accepted Answer · 2014-12-07 21:06:01Z

-1

You can handle the exception:

def always_print(s):
    try:
        print(s)
    except UnicodeEncodeError:
        print(s.encode('utf-8'))

answered Dec 7, 2014 at 21:06

Amber

531k89 gold badges643 silver badges558 bronze badges

3 Comments

clark800 Over a year ago

What if the terminal encoding is something completely unrelated to ascii? Encoding as utf-8 would make it appear as gibberish.

Amber Over a year ago

Yes, it would. Which is why you try a regular print call first to use the terminal's encoding.

clark800 Over a year ago

What I meant is if the terminal encoding is a non-unicode encoding unrelated to ascii, so it still fails to print directly and then displays gibberish due to the wrong encoding.

Collectives™ on Stack Overflow

Print unicode string in python regardless of environment

4 Answers 4

Comments

12 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

12 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related