2

I want to read a binary PNM image file from stdin. The file contains a header which is encoded as ASCII text, and a payload which is binary. As a simplified example of reading the header, I have created the following snippet:

#! /usr/bin/env python3
import sys
header = sys.stdin.readline()
print("header=["+header.strip()+"]")

I run it as "test.py" (from a Bash shell), and it works fine in this case:

$ printf "P5 1 1 255\n\x41" |./test.py 
header=[P5 1 1 255]

However, a small change in the binary payload breaks it:

$ printf "P5 1 1 255\n\x81" |./test.py 
Traceback (most recent call last):
  File "./test.py", line 3, in <module>
    header = sys.stdin.readline()
  File "/usr/lib/python3.4/codecs.py", line 313, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x81 in position 11: invalid start byte

Is there an easy way to make this work in Python 3?

2
  • did you try to change the input encoding ? stackoverflow.com/a/16549381/4954037 Commented Jul 18, 2015 at 11:03
  • @hiroprotagonist: Thanks for the hint. The approach indicated there did lead me to one possible solution -- although it is a bit of a hack to apply Unicode decoding to arbitrary binary data. Commented Jul 19, 2015 at 1:42

2 Answers 2

2

To read binary data, you should use a binary stream e.g., using TextIOBase.detach() method:

#!/usr/bin/env python3
import sys

sys.stdin = sys.stdin.detach() # convert to binary stream
header = sys.stdin.readline().decode('ascii') # b'\n'-terminated
print(header, end='')
print(repr(sys.stdin.read()))
Sign up to request clarification or add additional context in comments.

Comments

1

From the docs, it is possible to read binary data (as type bytes) from stdin with sys.stdin.buffer.read():

To write or read binary data from/to the standard streams, use the underlying binary buffer object. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc').

So this is one direction that you can take -- read the data in binary mode. readline() and various other functions still work. Once you have captured the ASCII string, it can be converted to text, using decode('ASCII'), for additional text-specific processing.

Alternatively, you can use io.TextIOWrapper() to indicate the use of the latin-1 character set on the input stream. With this, the implicit decode operation will essentially be a pass-through operation -- so the data will be of type str (which represent text), but the data is represented with a 1-to-1 mapping from the binary (although it could be using more than one storage byte per input byte).

Here's code that works in either mode:

#! /usr/bin/python3

import sys, io

BINARY=True ## either way works

if BINARY: istream = sys.stdin.buffer
else:      istream = io.TextIOWrapper(sys.stdin.buffer,encoding='latin-1')

header = istream.readline()
if BINARY: header = header.decode('ASCII')
print("header=["+header.strip()+"]")

payload = istream.read()
print("len="+str(len(payload)))
for i in payload: print( i if BINARY else ord(i) )

Test every possible 1-pixel payload with the following Bash command:

for i in $(seq 0 255) ; do printf "P5 1 1 255\n\x$(printf %02x $i)" |./test.py ; done

1 Comment

The hack of using latin-1 as a conduit for binary data works because it is 8-bit clean, whereas UTF-8 is not.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.