2

Let's say I have some 32-bit and 64-bit floating point values:

>>> import numpy as np
>>> v32 = np.array([5, 0.1, 2.4, 4.555555555555555, 12345678.92345678635], 
                   dtype=np.float32)
>>> v64 = np.array([5, 0.1, 2.4, 4.555555555555555, 12345678.92345678635], 
                   dtype=np.float64)

I want to serialize these values to text without losing precision (or at least really close to not losing precision). I think the canonical way of doing this is with repr:

>>> map(repr, v32)
['5.0', '0.1', '2.4000001', '4.5555553', '12345679.0']
>>> map(repr, v64)
['5.0', '0.10000000000000001', '2.3999999999999999', '4.5555555555555554', 
 '12345678.923456786']

But I want to make the representation as compact as possible to minimize file size, so it would be nice if values like 2.4 got serialized without the extra decimals. Yes, I know that's their actual floating point representation, but %g seems to be able to take care of this:

>>> ('%.7g ' * len(v32)) % tuple(v32)
'5 0.1 2.4 4.555555 1.234568e+07 '
>>> ('%.16g ' * len(v32)) % tuple(v64)
'5 0.1 2.4 4.555555555555555 12345678.92345679 '

My question is: is it safe to use %g in this way? Are .7 and .16 the correct values so that precision won't be lost?

3 Answers 3

7

Python 2.7 and later already have a smart repr implementation for floats that prints 0.1 as 0.1. The brief output is chosen in preference to other candidates such as 0.10000000000000001 because it is the shortest representation of that particular number that roundtrips to the exact same floating-point value when read back into Python. To use this algorithm, convert your 64-bit floats to actual Python floats before handing them off to repr:

>>> map(repr, map(float, v64))
['5.0', '0.1', '2.4', '4.555555555555555', '12345678.923456786']

Surprisingly, the result is natural-looking and numerically correct. More info on the 2.7/3.2 repr can be found in What's New and a fascinating lecture by Mark Dickinson.

Unfortunately, this trick won't work for 32-bit floats, at least not without reimplementing the algorithm used by Python 2.7's repr.

Sign up to request clarification or add additional context in comments.

8 Comments

and wouldn't this also lose 64-bit precision with a 32-bit interpreter?
actually, nevermind, I think floats are always upcasted to doubles, even on a 32-bit python
Just compared repr(0.1) in 2.6 to 2.7 and it's really nice. Unfortunately, I do want a shorter repr for 32-bit values, so won't work for me :(
What Python calls float is actually a C double, on all architectures.
If you really need a short repr for 32-bit floats, and if speed is not an issue, you could try implementing a simplified version of the shortest-representation algorithm. It can't be too hard if you don't need to be fast and find the shortest representation for 100% of the cases. (Where you don't, you can always punt and delegate the formatting to repr.)
|
6

To uniquely determine a single-precision (32-bit) floating point number in IEEE-754 format, it can be necessary to use 9 (significant, i.e. not starting with 0, unless the value is 0) decimal digits, and 9 digits are always sufficient.

For double-precision (64-bit) floating point numbers, 17 (significant) decimal digits may be necessary and are always sufficient.

I'm not quite sure how the %g format is specified, by the looks of it, it can let the representation begin with a 0 (0.1), so the safe values for the precision would be .9 and .17.

If you want to minimise the file size, writing the byte representations would produce a much smaller file, so if you can do that, that's the way to go.

10 Comments

Can't use binary unfortunately. Using a file specification that has numerical data as text (COLLADA)
It seems like .9 just does what repr does, so I guess I have to suffer with values like 0.100000001 :(
Depending on how much work you're willing to invest, you could write a function that produces the shortest representation that parses back to the original value. If you're interested, there's a paper by Burger and Dybvig (or was it Dybwig?).
I'd definitely be interested, although I need this to be fast too, so if I have to call a Python function for each value, it might be too slow for my needs
I don't know how fast repr is, but probably it's implemented in C and faster. The Burger/Dybvig algorithm isn't too fast, it's much faster to get a representation if you know how many significant digits you want in advance.
|
1

The C code that implements the fancy repr in 2.7 is mostly in Python/dtoa.c (with wrappers in Python/pystrtod.c and Objects/floatobject.c). In particular, look at _Py_dg_dtoa. It should be possible to borrow this code and modify it to work with float instead of double. Then you could wrap this up in an extension module, or just build it as an so and ctypes it.

Also, note that the source says the implementation is "Inspired by "How to Print Floating-Point Numbers Accurately" by Guy L. Steele, Jr. and Jon L. White [Proc. ACM SIGPLAN '90, pp. 112-126]." So, you might be able to implement something less flexible and simpler yourself by reading that paper (and whichever of the modfications documented in the dtoa.c comments seem appropriate).

Finally, the code is a minor change to code posted by David Gay at AT&T, and used in a number of other libraries (NSPR, etc.), one of which might have a more accessible version.

But before doing any of that, make sure there really is a performance issue by trying a Python function and measuring whether it's too slow.

And if this really is a performance-critical area, you probably don't want to loop over the list and call repr (or your own fancy C function) in the first place; you probably want a function that converts a numpy array of floats or doubles to a string all at once. (Ideally you'd want to build that into numpy, of course.)

One last thought: you're looking for "at least really close to not losing precision". It's conceivable that just converting to double and using the repr is close enough for your purposes, and it's obviously much easier than anything else, so you should at least test it to rule it out.

Needless to say, you should also test whether %.9g and %.17g are close enough for your purposes, since that's the next easiest thing that could possibly work.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.