8

I want to write a non-ascii character, lets say to standard output. The tricky part seems to be that some of the data that I want to concatenate to that string is read from json. Consider the follwing simple json document:

{"foo":"bar"}

I include this because if I just want to print then it seems enough to simply write:

print("→")

and it will do the right thing in python2 and python3.

So I want to print the value of foo together with my non-ascii character . The only way I found to do this such that it works in both, python2 and python3 is:

getattr(sys.stdout, 'buffer', sys.stdout).write(data["foo"].encode("utf8")+u"→".encode("utf8"))

or

getattr(sys.stdout, 'buffer', sys.stdout).write((data["foo"]+u"→").encode("utf8"))

It is important to not miss the u in front of because otherwise a UnicodeDecodeError will be thrown by python2.

Using the print function like this:

print((data["foo"]+u"→").encode("utf8"), file=(getattr(sys.stdout, 'buffer', sys.stdout)))

doesnt seem to work because python3 will complain TypeError: 'str' does not support the buffer interface.

Did I find the best way or is there a better option? Can I make the print function work?

17
  • 1
    So print(data['foo'] + u'→') doesn't work? Commented May 30, 2014 at 0:12
  • @user2357112: Not on my machine. Commented May 30, 2014 at 0:28
  • 1
    For your last example that calls print, in Python 3 encoding the string returns bytes. Since print requires a string, it calls the __str__ method, which for bytes just returns a repr, i.e. str("→".encode()) == "b'\\xe2\\x86\\x92'". Next print writes this useless repr to the file, but the BufferedWriter requires an object that supports the buffer interface, such as bytes. Commented May 30, 2014 at 2:13
  • @eryksun thank you! As print() is able to print all kinds of datatypes without explicit conversion to str I didnt think it would choke on bytes. Commented May 30, 2014 at 6:37
  • 2
    Printing has to first get an object as a string. This doesn't choke on Python 3 bytes. Decoding bytes using a default encoding would be wrong in general, since a bytes object isn't necessarily text. I just meant the repr string is "useless" for your needs. What choked is trying to print to a BufferedWriter, e.g. print('abc', file=sys.stdout.buffer). Commented May 30, 2014 at 7:16

2 Answers 2

3
+100

The most concise I could come up with is the following, which you may be able to make more concise with a few convenience functions (or even replacing/overriding the print function):

# -*- coding=utf-8 -*-
import codecs
import os
import sys

# if you include the -*- coding line, you can use this
output = 'bar' + u'→'
# otherwise, use this
output = 'bar' + b'\xe2\x86\x92'.decode('utf-8')

if sys.stdout.encoding == 'UTF-8':
    print(output)
else:
    output += os.linesep
    if sys.version_info[0] >= 3:
        sys.stdout.buffer.write(bytes(output.encode('utf-8')))
    else:
        codecs.getwriter('utf-8')(sys.stdout).write(output)

The best option is using the -*- encoding line, which allows you to use the actual character in the file. But if for some reason, you can't use the encoding line, it's still possible to accomplish without it.

This (both with and without the encoding line) works on Linux (Arch) with python 2.7.7 and 3.4.1. It also works if the terminal's encoding is not UTF-8. (On Arch Linux, I just change the encoding by using a different LANG environment variable.)

LANG=zh_CN python test.py

It also sort of works on Windows, which I tried with 2.6, 2.7, 3.3, and 3.4. By sort of, I mean I could get the '→' character to display only on a mintty terminal. On a cmd terminal, that character would display as 'ΓåÆ'. (There may be something simple I'm missing there.)

Sign up to request clarification or add additional context in comments.

8 Comments

With regard to Windows only sort of working, would changing 'utf-8' to sys.stdout.encoding print any better?
No. That would be the same as simply doing a print. If you're not changing the encoding, sys.stdout.encoding is the one it uses, which is why all the work to change it from it's default.
As an experiment, try the code here. It will show the effect of the encoding used on a terminal for all available encodings-- for ones that don't throw exceptions. I ran this on Windows & Linux, 2.7 & 3.4.
I cannot stress enough how important it is to ensure your terminal or console is correctly configured. It should not be Python's job to ensure this. Personally, I'd use output = output.encode('utf-8'), try:, sys.stdout.buffer.write(output), except AttributeError:, sys.stdout.write(output); codecs.getwriter() is overkill here, and you need to test for features, not versions. You can use the io module in Python 2 as well so sys.stdout could actually have the .buffer attribute there too.
@MartijnPieters Is there a tutorial or reference on how to correctly configure a console/terminal (cmd/powershell/other?) on Windows?
|
1

If you don't need to print to sys.stdout.buffer, then the following should print fine to sys.stdout. I tried it in both Python 2.7 and 3.4, and it seemed to work fine:

# -*- coding=utf-8 -*-
print("bar" + u"→")

3 Comments

This does not work if sys.stdout.encoding != "UTF-8", such as on Windows.
@snapshoe It is obvious that it will not be displayed properly if the output goes to something with limited capabilities. But Python does write to the output in UTF-8, and the OP wanted to send the output in a file, it seems.
@rds I don't see any mention of outputting to a file. I do see mentioned everywhere, including the title of the post, about printing to stdout.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.