Newest 'floating-point' Questions

2 votes

1 answer

132 views

Accurate computation of the inverse gamma function with the standard C math library

The inverse of the gamma function over the reals is multivalued with an infinite number of branches. This self-answered question is about the principal inverse of the gamma function, Γ0-1(x), whose ...

njuffa

27.1k

asked yesterday

5 votes

3 answers

163 views

PHP 8.4 rounding gives different result than 8.2

I am in the process of upgrading our code from php 8.2 to 8.4 I noticed we are getting some test failures because of round() returning different values than expected. Ultimately the problem could be ...

yeaitsme

131

asked Nov 11 at 13:18

4 votes

1 answer

108 views

Invalid Operation with Arm64 fcmp and simd

Consider the following snippet: ldr q0, [x0] cmeq v0.16b, v0.16b, #0 shrn v0.8b, v0.8h, #4 fcmp d0, #0.0 This is a common way to implement functions such as strlen with SIMD. According to the Arm64 ...

alexisrdt

524

asked Nov 7 at 7:27

Advice

1 vote

10 replies

164 views

What is the minimum range of the quotient assigned by `remquo()`?

Is the minimum range that remquo(x, y, quo) assigns *quo [-7 ... +7]? Is not, what is the minimum compliant range? double remquo(double x, double y, int *quo); has the following description: The ...

chux

158k

asked Nov 4 at 22:18

16 votes

2 answers

696 views

Parsing of small floats with std::istream

I have a program that reads the coordinates of some points from a text file using std::istringstream, and then it verifies the correctness of parsing by calling stream's operator bool(). In general it ...

Fedor

24.7k

asked Oct 31 at 14:23

-4 votes

3 answers

296 views

the pow() function isn't accurate

I read Calculating very large exponents in python , and Ignacio Vazquez-Abrams's answer was to use the pow() function So , I run the following command in Python 3.9.6 : print(int(pow(1.5,96))) and ...

Lhachimi

119

asked Oct 25 at 22:54

2 votes

1 answer

244 views

Why does Newton’s method overshoot on the first deceleration step in my motion profile generator?

I’m porting a Python motion profile generator to C to implement for my STM32H743. The generator produces step timings for a simple acceleration → cruise → deceleration motion profile. See the ...

Marvin W

43

asked Sep 22 at 20:22

0 votes

1 answer

295 views

Inaccuracy replicating Fortran mixed-precision expression in Rust

I have the following code in my Fortran program, where both a and b are declared as REAL (KIND=8): a = 0.12497443596150659d0 b = 1.0 + 0.00737 * a This yields b as 1.0009210615647672 For comparison, ...

sgfw

356

asked Sep 9 at 0:27

4 votes

1 answer

169 views

Weird behavior in large complex128 NumPy arrays, imaginary part only [closed]

I'm working on numerical simulations. I ran into an issue with large NumPy arrays (~ 26 GB) on Linux with 128 GB of RAM. The arrays are of type complex128. Arrays are instantiated without errors (if ...

laserpropsims

71

asked Sep 6 at 1:52

0 votes

2 answers

248 views

turn Python float argument into numpy array, keep array argument the same

I have a simple function that is math-like: def y(x): return x**2 I know the operation x**2 will return a numpy array if supplied a numpy array and a float if supplied a float. for more complicated ...

villaa

1,259

asked Sep 4 at 17:04

25 votes

13 answers

3k views

How can I parse a string to a float in C in a way that isn't affected by the current locale?

I'm writing a program where I need to parse some configuration files in addition to user input from a graphical user interface. In particular, I'm having issues with parsing strings taken from the ...

Newbyte

3,915

asked Aug 24 at 9:50

2 votes

0 answers

206 views

Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?

I am measuring the latency of instructions. For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...

Zack Light

362

asked Aug 22 at 5:35

4 votes

2 answers

311 views

Why does adding a value to Float.MAX_VALUE not reach infinity?

According to the standard, overflow in java is handled using a special value called infinity, but here the sum is 3.4028235E38. Why is this the case? public class FloatingPointTest { public static ...

saul goodman

43

asked Aug 17 at 13:32

6 votes

1 answer

260 views

Speeding up integer division with doubles

I have a fixed-point math-heavy project and I was looking to speed up integer divisions. I tested double division with SSE4 and AVX2 and got nearly 2x speedup versus scalar integer division. I wonder ...

M.kazem Akhgary

19.3k

asked Aug 8 at 21:48

0 votes

1 answer

150 views

GCC offers a _Float16 type, but - what about the functions to work with it?

GCC offers a 16-bit floating point type, outside of the C language standard: _Float16 - at least for x86_64. This allowance is described here. However - the GCC documentation does not seem to indicate ...

einpoklum

137k

asked Aug 5 at 16:33

5 votes

1 answer

156 views

Is it expected that vmapping over different input sizes for the same function impacts the accuracy of the result?

I was suprised to see that depending on the size of an input matrix, which is vmapped over inside of a function, the output of the function changes slightly. That is, not only does the size of the ...

hvater

100

asked Aug 5 at 12:17

3 votes

2 answers

194 views

How does Oracle convert decimal values to float?

If I have a float(5) column, why does 7.89 get rounded to 7.9 but 12.79 gets rounded to 13, not 12.8? Binary forms are as follows for 3 examples: 7.89 0111.01011001 ------ round to------\> 7.9 ...

titi zarif

31

asked Aug 5 at 0:14

0 votes

1 answer

221 views

How can a long double be that big in C++? [duplicate]

The sizeof(long double) is 8, which means that if I use all the bits for the integer part of an unsigned number, I can maximum store 2^64-1=18446744073709551615. However, std::numeric_limits<long ...

alekscooper

861

asked Aug 4 at 12:58

0 votes

0 answers

135 views

Previous representable floating point value with `ffast-math`

I am binary searching an array of floating points. In order to modify the result of a binary search for certain values, I insert previous representable floating point values. I started with: std::...

Denis Yaroshevskiy

1,507

asked Jul 30 at 13:39

1 vote

1 answer

106 views

integer exact computation with logs

I need to compute ceil(log_N(i)) where log_N is the log with positive integer base N and i is also a positive integer. The straight forward python implementation using floating point math fails at ...

user1816847

2,148

asked Jul 29 at 22:22

53 votes

2 answers

3k views

Why is 0.0 printed as 0.00001 when rounding upward?

If in a C++ program, I activate upward rounding mode for floating-point numbers and print some double-precision value already rounded to an integer, e.g.: #include <cfenv> #include <iostream&...

Fedor

24.7k

asked Jul 24 at 18:38

1 vote

2 answers

158 views

Can I tell my compiler that a floating-point value is not NaN nor +/-infinity?

I'm writing a C function float foo(float x) which manipulates a floating-point value. It so happens, that I can guarantee the function will only ever be called with finite values - neither NaN's, nor +...

einpoklum

137k

asked Jul 20 at 15:18

0 votes

1 answer

118 views

Why does NVCC not optimize ldexpf with a constexpr power-of-two exponent into a simple fmul?

Consider the following CUDA code: enum { p = 5 }; __device__ float adjust_mul(float x) { return x * (1 << p); } __device__ float adjust_ldexpf(float x) { return ldexpf(x, p); } I would expect ...

einpoklum

137k

asked Jul 20 at 15:09

1 vote

1 answer

123 views

Problems with printing double constants in c

I have codes that look like this #include <float.h> #include <stdio.h> int main(void) { long double x = LDBL_MAX; printf("%Lf ", x); printf("%Lf\n", 1....

leaner18932

23

asked Jul 20 at 4:58

0 votes

3 answers

120 views

Is there a known solution for converting IEEE Float values to Hexadecimal in Ada without using the IEEE package?

I do not have the ability to update our Ada compiler set so include the IEEE packages. Is there a way to convert a Float into a Hexadecimal integer? For instance, a Float value of 1.5 as input should ...

Stacey Robert Greenstein

11

asked Jul 16 at 19:57

0 votes

1 answer

253 views

GCC -Wunsuffixed-float behaviour

(major edit of the question) I'm using arm-none-eabi-gcc 10.2. The following code const double d = 1.0; int main() { return d*d; } compiled with arm-none-eabi-gcc -Wall -Wextra -Wpedantic -...

Guillaume Petitjean

2,798

asked Jul 8 at 7:51

10 votes

4 answers

310 views

Why does this lookup table sine estimation perform worse when using float instead of double?

I've written a simple sine estimation function which uses a lookup table. Out of curiosity, I tried both float and double types, expecting float to perform a bit better because of being able to pack ...

multitaskPro

723

asked Jul 7 at 2:38

2 votes

2 answers

264 views

How to get integer unix timestamps in python? Not casting float to integer

The float timestamp used in Python seems to be problematic. First, it will lose its precision as time goes on, and second, the timestamp is not usable, for example, storing in a database big integer ...

Yadav Dhakal

45

asked Jul 4 at 11:44

17 votes

1 answer

896 views

Performance of two inverse square root expressions (accounting for CPU pipelining)

To compute the inverse of the square root of a double float in C, the most natural code would be the following: 1 / sqrt(x) (sqrt being a math.h function). At high optimization levels, I expected that ...

Oscar Belletti

193

asked Jul 1 at 19:48

2 votes

1 answer

183 views

Fast bithacked log2 approximation

In order to estimate entropy of a sequence, I came up with something like this: float log2_dirty(float x) { return std::bitcast<float>((std::bitcast<int>(x) >> 4) + 0x3D800000)) -...

Aki Suihkonen

20.5k

asked Jun 26 at 7:58

-2 votes

2 answers

231 views

Taylor Series in Java using float

I am implementing a method berechneCosinus(float x, int ordnung) in Java that should approximate the cosine of x using the Taylor series up to the specified order. My questions: Can I optimize the ...

Andre

21

asked Jun 21 at 19:14

3 votes

2 answers

178 views

Efficiently find the minimum exponent to eliminate the fractional part of a floating point number?

Given a float or a double, I want to find the smallest exponent e such that n*pow(2,e) is an integer. I've currently got this code that just loops multiplying by 2 until the fractional part disappears:...

Logan R. Kearsley

930

asked Jun 18 at 16:02

2 votes

1 answer

206 views

Comparing initialized floating point variable to zero, again

We know that floating point values cannot be compared with the == operator, due to precision issues. However, the following code, which initializes a double variable to an integer 0, successfully ...

Pietro

13.5k

asked Jun 16 at 11:29

4 votes

3 answers

174 views

How to calculate largest float smaller than input integer in C#?

I need a C# function that takes in an integer and returns a float, which is the largest float smaller than that integer. In normal (not floating point) math, this is an example: For input of 5 the ...

Laserna Tunika

43

asked Jun 12 at 11:59

4 votes

0 answers

185 views

Questions regarding the new C++ fixed-width floating-point types in C++23

With the introduction of the new fixed-width floating point types in C++23 (the std::floatN_t types in <stdfloat>), some guarantees regarding the representation of floating-point types are not ...

alecov

5,262

asked May 30 at 21:54

0 votes

4 answers

264 views

Rounding floats / decimals in e-commerce (PHP 8.3 and Laravel 11)

So due to the fact that product quantities are not limited to integers but can store decimals up to 4 digits, I can have a product with quantity of 69.4160 square meters of something. Now assuming the ...

Matt Komarnicki

5,472

asked May 14 at 10:26

0 votes

1 answer

193 views

c# display full float value

I know that binary storage for floats cannot always store the exact value. I know that .NET Framework only displays up to 7 significant digits for floats, and uses the 8th to round the 7th. I also ...

Jeff Shepler

2,141

asked May 11 at 22:01

4 votes

1 answer

134 views

floating-point implementation of a function with removable singularity

Consider the following function: ƒ(x) = sin(sqrt(x)) / sqrt(x) if x > 0 ƒ(x) = 1 if x = 0 ƒ(x) = sinh(sqrt(-x)) / sqrt(-x) if x < 0 This is an entire function with a Taylor series at 0 of: 1/1!...

emacs drives me nuts

4,337

asked May 10 at 14:23

3 votes

2 answers

178 views

Why does sqrt(x+1)-sqrt(x) result to 0 in JS? [duplicate]

I’m trying to compute the difference between two square roots in JS. function deltaSqrt(x) { return Math.sqrt(x + 1) - Math.sqrt(x); } console.log(deltaSqrt(1e6)); // 0.0004999998750463419 ...

Ben Wang

39

asked May 9 at 19:33

0 votes

2 answers

90 views

Why does the output is 4 seconds, but real time of work is less? [closed]

I have a simple decorator to measure time of function in seconds from functools import wraps from time import time def time_it(func): @wraps(func) def wrapper(*args, **kwargs): ...

mascai

1,718

asked May 3 at 15:00

8 votes

2 answers

730 views

Sine approximation, did i beat Remez?

First, it is most compact sine approximation ever. It seems i can do better than Remez in terms of precision/performance. Here [0,pi/2] approximation range. double p[] = -0.0020836519336891552, -0....

minorlogic

2,055

asked May 1 at 12:22

0 votes

0 answers

106 views

In C#, why are 0.025f and 0.025 equal to 0.025m? [duplicate]

In C#, I have this code: float initialResolutionFloat = 0.025f; double initialResolutionDouble = 0.025; decimal initialResolutionDecimal = 0.025m; decimal resolutionFloat = (decimal)...

Gilles jr Bisson

533

asked Apr 30 at 17:16

0 votes

0 answers

74 views

How to print the value of a floating point register using radare2

I'm debugging an executable program with radare2 on an armv8 computer, and I don't know how to print the values in the floating-point registers。 I tried using the dr directive, but the output is wrong....

XuezhuJiang

1

asked Apr 30 at 2:13

0 votes

2 answers

90 views

How do I use g_assert_cmpfloat () to check if two floats are equal without generating a safety warning?

I'm writing some testcases for a program using the GLib testing facilities. I want to assert that two floats have the same value, and g_assert_cmpfloat () seemed like an appropriate function for that. ...

Newbyte

3,915

asked Apr 29 at 21:13

0 votes

1 answer

108 views

How to tell ARM clang to emit SCVTF with the #fbits parameter?

In an embedded application we need to convert a signed-fractional number (S1.31 format) to a single-precision floating point number. A C function looks like this: #include <stdint.h> float ...

ysap

8,241

asked Apr 24 at 2:55

3 votes

1 answer

223 views

How do I convert a random uint32_t into a random float in the interval [0, 1) using the method from Xoshiro documentation?

On the Xoshiro webpage, they have the following statement under the heading "Generating uniform doubles in the unit interval": A standard double (64-bit) floating-point number in IEEE ...

Sasha

31

asked Apr 18 at 16:32

4 votes

5 answers

313 views

Why no floating point error occurs in print(0.1* 100000) vs (Decimal(0.1)*100000) due to FP representation of 0.1?

I am studying numerical analysis and I have come across this dilemma. Running the following script, from decimal import Decimal a = 0.1 ; N = 100000 ; # product calculation P = N*a # Print product ...

Nicola Sergio

91

asked Apr 17 at 14:14

3 votes

3 answers

181 views

For a float number a.bc without underflow and overflow, does a.bc.toString() always =="a.bc"? May it return something like "a.bc00000000000000000001"?

For example, I know 0.12 is not exactly equals to 12/100 in decimal because it needs to be rounded to some value. However, I'm not asking about mathematical problem , but about Javascript spec that ...

wcminipgasker2023

221

asked Apr 17 at 3:07

5 votes

2 answers

168 views

Is the Java float to double cast lossy?

I am creating a float value using raw int bits where the mantissa has all bits set to 1. However, casting to double flips the final bit to 0. Why does this happen? From what I've been seeing online, ...

Harsh Motwani

126

asked Apr 15 at 22:37

0 votes

1 answer

83 views

Fortran selected_real_kind and MKL double precision

Following this best practices guide, I have a module in my Fortran code that defines a double precision type, module kind_parameters implicit none public ! Double precision real numbers, ...

Jasper

199

asked Apr 15 at 7:29

Collectives™ on Stack Overflow