Python/numpy floating-point text precision

Question

Let's say I have some 32-bit and 64-bit floating point values:

>>> import numpy as np
>>> v32 = np.array([5, 0.1, 2.4, 4.555555555555555, 12345678.92345678635], 
                   dtype=np.float32)
>>> v64 = np.array([5, 0.1, 2.4, 4.555555555555555, 12345678.92345678635], 
                   dtype=np.float64)

I want to serialize these values to text without losing precision (or at least really close to not losing precision). I think the canonical way of doing this is with repr:

>>> map(repr, v32)
['5.0', '0.1', '2.4000001', '4.5555553', '12345679.0']
>>> map(repr, v64)
['5.0', '0.10000000000000001', '2.3999999999999999', '4.5555555555555554', 
 '12345678.923456786']

But I want to make the representation as compact as possible to minimize file size, so it would be nice if values like 2.4 got serialized without the extra decimals. Yes, I know that's their actual floating point representation, but %g seems to be able to take care of this:

>>> ('%.7g ' * len(v32)) % tuple(v32)
'5 0.1 2.4 4.555555 1.234568e+07 '
>>> ('%.16g ' * len(v32)) % tuple(v64)
'5 0.1 2.4 4.555555555555555 12345678.92345679 '

My question is: is it safe to use %g in this way? Are .7 and .16 the correct values so that precision won't be lost?

user4815162342 · Accepted Answer · 2012-09-22 08:51:17Z

7

Python 2.7 and later already have a smart repr implementation for floats that prints 0.1 as 0.1. The brief output is chosen in preference to other candidates such as 0.10000000000000001 because it is the shortest representation of that particular number that roundtrips to the exact same floating-point value when read back into Python. To use this algorithm, convert your 64-bit floats to actual Python floats before handing them off to repr:

>>> map(repr, map(float, v64))
['5.0', '0.1', '2.4', '4.555555555555555', '12345678.923456786']

Surprisingly, the result is natural-looking and numerically correct. More info on the 2.7/3.2 repr can be found in What's New and a fascinating lecture by Mark Dickinson.

Unfortunately, this trick won't work for 32-bit floats, at least not without reimplementing the algorithm used by Python 2.7's repr.

edited Sep 22, 2012 at 8:51

answered Sep 19, 2012 at 20:54

user4815162342

159k22 gold badges349 silver badges418 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

jterrace Over a year ago

and wouldn't this also lose 64-bit precision with a 32-bit interpreter?

jterrace Over a year ago

actually, nevermind, I think floats are always upcasted to doubles, even on a 32-bit python

jterrace Over a year ago

Just compared repr(0.1) in 2.6 to 2.7 and it's really nice. Unfortunately, I do want a shorter repr for 32-bit values, so won't work for me :(

user4815162342 Over a year ago

What Python calls float is actually a C double, on all architectures.

user4815162342 Over a year ago

If you really need a short repr for 32-bit floats, and if speed is not an issue, you could try implementing a simplified version of the shortest-representation algorithm. It can't be too hard if you don't need to be fast and find the shortest representation for 100% of the cases. (Where you don't, you can always punt and delegate the formatting to repr.)

|

Daniel Fischer · Accepted Answer · 2012-09-19 20:17:16Z

6

To uniquely determine a single-precision (32-bit) floating point number in IEEE-754 format, it can be necessary to use 9 (significant, i.e. not starting with 0, unless the value is 0) decimal digits, and 9 digits are always sufficient.

For double-precision (64-bit) floating point numbers, 17 (significant) decimal digits may be necessary and are always sufficient.

I'm not quite sure how the %g format is specified, by the looks of it, it can let the representation begin with a 0 (0.1), so the safe values for the precision would be .9 and .17.

If you want to minimise the file size, writing the byte representations would produce a much smaller file, so if you can do that, that's the way to go.

answered Sep 19, 2012 at 20:17

Daniel Fischer

184k19 gold badges319 silver badges436 bronze badges

10 Comments

jterrace Over a year ago

Can't use binary unfortunately. Using a file specification that has numerical data as text (COLLADA)

jterrace Over a year ago

It seems like .9 just does what repr does, so I guess I have to suffer with values like 0.100000001 :(

Daniel Fischer Over a year ago

Depending on how much work you're willing to invest, you could write a function that produces the shortest representation that parses back to the original value. If you're interested, there's a paper by Burger and Dybvig (or was it Dybwig?).

jterrace Over a year ago

I'd definitely be interested, although I need this to be fast too, so if I have to call a Python function for each value, it might be too slow for my needs

Daniel Fischer Over a year ago

I don't know how fast repr is, but probably it's implemented in C and faster. The Burger/Dybvig algorithm isn't too fast, it's much faster to get a representation if you know how many significant digits you want in advance.

|

abarnert · Accepted Answer · 2012-09-19 21:26:51Z

The C code that implements the fancy repr in 2.7 is mostly in Python/dtoa.c (with wrappers in Python/pystrtod.c and Objects/floatobject.c). In particular, look at _Py_dg_dtoa. It should be possible to borrow this code and modify it to work with float instead of double. Then you could wrap this up in an extension module, or just build it as an so and ctypes it.

Also, note that the source says the implementation is "Inspired by "How to Print Floating-Point Numbers Accurately" by Guy L. Steele, Jr. and Jon L. White [Proc. ACM SIGPLAN '90, pp. 112-126]." So, you might be able to implement something less flexible and simpler yourself by reading that paper (and whichever of the modfications documented in the dtoa.c comments seem appropriate).

Finally, the code is a minor change to code posted by David Gay at AT&T, and used in a number of other libraries (NSPR, etc.), one of which might have a more accessible version.

But before doing any of that, make sure there really is a performance issue by trying a Python function and measuring whether it's too slow.

And if this really is a performance-critical area, you probably don't want to loop over the list and call repr (or your own fancy C function) in the first place; you probably want a function that converts a numpy array of floats or doubles to a string all at once. (Ideally you'd want to build that into numpy, of course.)

One last thought: you're looking for "at least really close to not losing precision". It's conceivable that just converting to double and using the repr is close enough for your purposes, and it's obviously much easier than anything else, so you should at least test it to rule it out.

Needless to say, you should also test whether %.9g and %.17g are close enough for your purposes, since that's the next easiest thing that could possibly work.

Collectives™ on Stack Overflow

Python/numpy floating-point text precision

3 Answers 3

8 Comments

10 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

10 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related