16

I have the following code snippet:

#!/usr/bin/env python3

print(float(b'5'))

Which prints 5.0 with no error (on Linux with utf-8 encoding). I'm very surprised that it doesn't give an error since Python is not supposed to know what encoding is used for the bytes object.

Any insight?

6
  • 2
    Have you rad the documentation? and docs.python.org/3.6/c-api/buffer.html#bufferobjects Commented May 18, 2018 at 10:07
  • 4
    @Kasramvd: the documentation for float() states it accepts a str, a number, or a type that implements __float__. bytes doesn't implement __float__. Commented May 18, 2018 at 10:13
  • @MartijnPieters Here it's mentioned that If the argument is a string, it should contain a decimal number, optionally preceded by a sign, and optionally embedded in whitespace. doesn't b'5' follow that rule? Although it should have been specified clearly in the documentation. Commented May 18, 2018 at 10:17
  • 2
    Fair question, since not all encodings are supersets of ASCII. Commented May 18, 2018 at 10:17
  • 2
    @Kasramvd: no, it doesn't. The bytes type is not considered a string. Commented May 18, 2018 at 10:24

1 Answer 1

13

When passed a bytes object, float() treats the contents of the object as ASCII bytes. That's sufficient here, as the conversion from string to float only accepts ASCII digits and letters, plus . and _ anyway (the only non-ASCII codepoints that would be permitted are whitespace codepoints), and this is analogous to the way int() treats bytes input.

Under the hood, the implementation does this:

  • because the input is not a string, PyNumber_Float() is called on the object (for str objects the code jumps straight to PyFloat_FromString).
  • PyNumber_Float() checks for a __float__ method, but if that's not available, it calls PyFloat_FromString()
  • PyFloat_FromString() accepts not only str objects, but any object implementing the buffer protocol. The String name is a Python 2 holdover, the Python 3 str type is called Unicode in the C implementation.
  • bytes objects implement the buffer protocol, and the PyBytes_AS_STRING macro is used to access the internal C buffer holding the bytes.
  • A combination of two internal functions named _Py_string_to_number_with_underscores() and float_from_string_inner() is then used to parse ASCII bytes into a floating point value.

For actual str strings, the CPython implementation actually converts any non-ASCII string into a sequence of ASCII bytes by only looking at ASCII codepoints in the input value, and converting any non-ASCII whitespace character to ascii 0x20 spaces, to then use the same _Py_string_to_number_with_underscores() / float_from_string_inner() combo.

I see this as a bug in the documentation and have filed issue with the Python project to have it updated.

Sign up to request clarification or add additional context in comments.

3 Comments

I know there won't be a thing about python that this guy doesn't know.
Thanks for the great answer. So, just to be clear, this will fail with certain encodings, such as UTF-16?
@static_rtti: absolutely, because the \x00 bytes won't be accepted. The bytes must be ASCII only, and fit the float() string interpretation rules.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.